Skip to content

The ArXiv Atlas is a web-based tool that visualizes research papers from the ArXiv repository, enabling users to explore and discover connections between papers through an interactive map and receive recommendations based on their queries.

License

Notifications You must be signed in to change notification settings

Jaluus/ArXivAtlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXiv Atlas

Static Badge Static Badge

This project visualizes the Arxiv Atlas, allowing users to explore research papers and their relationships visually. It also allows for RAG recommendations based on user queries.

ArXiv Atlas

Project Overview

The Arxiv Atlas is a web-based tool that provides an interactive map of research papers from the Arxiv repository. Users can navigate through different categories, discover connections between papers, and get recommendations based on their interests.

Features

  • Interactive Map: Explore an interactive map of Arxiv papers creaed using LLM embeddings of the papers.
  • Search Functionality: Search for specific papers or use semantic search to find relevant papers based on free text.
  • Visualization: View relationships and connections between papers based purely on their content and resulting embeddings.

How does the Recommendation System work?

The recommendation system relies on embeddings of the papers. These embeddings are vector representations of the abstracts and titles, created using a BERT-type language model. While it's possible to use GPT-type language models for contextualized embeddings, they typically require more computational resources.

Embeddings capture the semantic meaning of the papers, enabling the calculation of similarity between papers based on their embeddings. The similarity is measured using cosine similarity, which compares the angles between the vectors representing the papers. A higher cosine similarity score indicates a greater semantic similarity between two papers.

I calculated the similarity between all1 papers in the dataset and stored the results in a similarity matrix. When a user selects a paper, we use this similarity matrix to retrieve and recommend the most similar papers.

Additionally, the system can provide recommendations based on a semantic query, such as "What are the challenges when using GAN-based classification methods?". The query is embedded, and the system retrieves the most similar papers based on the closest match. The search results are further improved by reranking them using a reranking model.

What model was used for the embeddings?

For internal testing, I used the AnglE-optimized Text Embeddings2 model with some fine-tuning to ArXiv embeddings. A good library for fine-tuning is Ragas.

But for the ArXiv Atlas I'm using the OpenAI text-embedding-3-large model. It has a good balance between performance and cost, and allows me not to run a GPU on my server 24/7 to serve requests. Also, the model has the very nice property of being trained with Matryoshka Representation Learning ! This means that the model compresses a lot of information into its first 256 dimensions3, giving me the ability to discard the rest of the dimensions, saving a lot of memory and thus being extremely performant.

How does the visualization work?

Since the embedding vectors are typically 256 to 3072 dimensional and we are 3 dimensional beings, we need to somehow project the embeddings down to 2 or 3 dimensions to visualize them. There are several different ways to do this, Random Projections, PCA, t-SNE, PaCMAP, etc....
I'm using UMAP for this task because it's very performant, gives good results, and I'm a bit biased towards it. I also use an autoencoder under the hood to compute the UMAP projections using a parametric approach. Its not in this repo as the code is messy, but I may publish it in the future.

If you are generally interested in how these methods work, there is a nice paper that explains a lot of them, but beware, its biased towards PaCMAP.

Can I get the Data?

Sure, but I can't provide the similarity matrix because it's way too big.
But I can provide a script to calculate the N closest papers to a given paper.

Name Number of Papers Size Last Updated Link
Quantum Physics 68.548 59.3 MB 21. July 2024 Download
High Energy Physics 114.218 99.4 MB 21. July 2024 Download
Physics 134.741 126 MB 21. July 2024 Download
Astro Physics 160.252 169 MB 21. July 2024 Download
Condensed Matter 171.503 155 MB 21. July 2024 Download
Computer Science 485.772 452 MB 21. July 2024 Download
Combined 1.238.980 1.12 GB 21. July 2024 Download

Data Format

The data is stored in a Pandas DataFrame and saved as a Pickle file. The DataFrame has the following columns:

Column Name Type Description Example
title str Title of the paper "GPT-4 Technical Report"
arxiv_id str Arxiv ID of the paper "2303.08774"
abstract str Abstract of the paper "We report the development of GPT-4, a large-sc.."
main_category str Main category of the paper "cs.CL"
categories list Categories of the paper ["cs.CL", "cs.AI"]
revision str Revision of the paper "6"
published datetime Date of publication "2023-03-15 17:15:04"
updated datetime Date of last update "2024-03-04 06:01:33"
authors list Authors of the paper ["OpenAI", "J. Achiam", "S. Adler", ...]
journal_ref str Journal reference "J. Mach. Learn. Res. 22 (2021) 1-21" or "<NA>"
doi str DOI of the paper "10.1234/5678" or "<NA>"
arxiv_comment str Arxiv comment "Submitted to ICLR 2023" or "<NA>"
arxiv_DOI str Arxiv DOI of the paper "10.1234/5678" or "<NA>"
abstract_embedding np.ndarray (np.float16) Embedding of the abstract (256 Dimensions) computed using the text-embedding-3-large model [-0.011314, -0.0605, -0.02097, -0.004242, ...]
arxiv_year int32 The first part of the arxiv id, used for sorting 2303
arxiv_number int32 The second part of the arxiv id, used for sorting 8774
x_umap float32 UMAP projection of the embedding in the x dimension (Not available in the Combined dataset) 0.1234
y_umap float32 UMAP projection of the embedding in the y dimension (Not available in the Combined dataset) 0.5678

Getting Started

The project comprises a frontend and a backend component.

The frontend is built using PixiJS for rendering, D3 for data manipulation and plain Javascript for the rest. To learn more about the frontend, check out the frontend README.

The backend is built using Python and FastAPI for serving the data and handling search requests. To learn more about the backend, check out the backend README.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

Special thanks to the Arxiv team for maintaining the repository and providing the data for this project.
This project is not affiliated with Arxiv.org.

Contact

This project is currently maintained by Jan-Lucas Uslu.
Feel free to reach out to us for any questions or feedback.
If you like the project, feel free to star it ⭐.

Demo

Check out the live demo at atlas.uslu.tech.

Footnotes

  1. My full dataset contains about 1.2 million papers, meaning 600.000² dot products! My graphics card was fuming during the calculation of the similarity matrix.

  2. The AnglE-optimized Text Embeddings model is also based on BERT, and the paper is an interesting read for anyone interested in embeddings. You can also find it in the ArXiv Atlas. 🧠

  3. The anglE model produces 1024-dimensional embeddings, this means that the embeddings are about 4 times bigger than the OpenAI embeddings, so about 6GB of RAM are gone when serving the embeddings.

About

The ArXiv Atlas is a web-based tool that visualizes research papers from the ArXiv repository, enabling users to explore and discover connections between papers through an interactive map and receive recommendations based on their queries.

Topics

Resources

License

Stars

Watchers

Forks