French-Web-Domain-Classification-Using-Text-and-Graph-Data

Web Domain classification is a major field of research for information retrieval. Most of the studies claim that domain classification is strongly correlated with its content but requires very appro- priate descriptors. In this paper, we are aiming to explore both text and graph data for the creation of a classification model for french web domains.

Input Data

Our input dataset consist of a directed graph, a list of texts and labels.

graph_data (.txt file): the data is in the form of a directed graph. This graph contains 28002 vertices and 319498 directed weighted edges. Nodes correspond to domains and edges correspond to the total number of hyperlinks connecting two domains.

text_data (folder): each file in the folder represents the total text of all the web pages of the corresponding domain. We have a total of 2554 French domains with text data.

train_labels and test_labels: they contain the indexes of the doamins along with their labels. In total we have 8 different labels: business/finance, entertainment, tech/science, education/research, politics/government/law, health/medical, news/press and sports.

Code

The notebook file is a numerical implementation that reproduces all the graphs and results in the paper. We also included two python files: 'text_baseline.py' and 'graph_baseline.py', which are basically blueprints that can serve for the reader to implement their own models and do more experimentation and validation in ease.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
text_data		text_data
French Web Domain Classification Using Text and Graph Data.pdf		French Web Domain Classification Using Text and Graph Data.pdf
French_web_domain_classification_using_graph_text_data.ipynb		French_web_domain_classification_using_graph_text_data.ipynb
LICENSE		LICENSE
README.md		README.md
graph_baseline.py		graph_baseline.py
graph_data.txt		graph_data.txt
text_baseline.py		text_baseline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

French-Web-Domain-Classification-Using-Text-and-Graph-Data

Input Data

Code

About

Releases

Packages

Contributors 2

Languages

License

Mjidiba97/French-Web-Domain-Classification-Using-Text-and-Graph-Data

Folders and files

Latest commit

History

Repository files navigation

French-Web-Domain-Classification-Using-Text-and-Graph-Data

Input Data

Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages