Skip to content

Pull high-quality, efficient embeddings for PubMed, arXiv and Wikipedia from Huggingface and use for local LLM inference/Retrieval Augmented Generation (RAG)

License

Notifications You must be signed in to change notification settings

somewheresystems/dataclysm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

@@@@@%%%%%@#-..:-*%%@@@@@@#********%@@@@%%%%#*++#@#*#@%%@@@%%%%%%%%%%%%%%%%+..:*%#%%%%@@@@
@@@@@%%%%##@@=-*=..+%#+::+@-:-:::.*#-:-%%%*:.---.@*.#%@#:.%*.:#-.---=%--%@*..-%@##%%%%@@@@
@@@@@%%%%#*%@+-#@#::*@*---@*@@-+@%*%:=.#%+:=%@@@#@#:#%@@%=:.-@#:-%@@*%=:+%-.-=@#*#%%%%@@@@
@@@@@%%%%##%@+-#@%=-*@-+*.#@@@-+@@@+-%-:@=-*@@@@@@#:#@@@@%-:@@@%+-=#@%=-:=.+=+@#*#%%%%@@@@
@@@@@%%%%##%@*+%@#=+%+-==-+@@@=+@@%-===-*#==%@@@%@#=#@@@%%=+@@%@@@*-*@+**-=%+*@###%%%%@@@@
@@@@@%%%%##@@#+#=-*%%=*@@#=%@@=+@@+=%@@+=@%+:-===@#-===-#@++@@#-++-:*@+#@+%@*#@@*#%%%%@@@@
@@@@@%%%%##%@++*%@@@#%#@@#@#@%@%%%%%%@%%%%@@@%%@@@#@@@@@@%%%%@#@%#%@@%%#@@@%*+@%*#%%%%@@@@

This repository provides a comprehensive guide to getting started with using DATACLYSM: a series of high-quality embeddings libraries, with coverage for the entirety of PubMed, English Wikipedia and arXiv. The guide is based on the getting_started.ipynb notebook.

It also includes a demo of the Spatial Search Engine, a Streamlit app for exploring the Dataclysm datasets visually and performing ranked searches on proximally related articles (by title, currently).

Table of Contents

  1. Installation
  2. Initialization
  3. Retrieval Augmented Generation
  4. Reranking Results
  5. License

Installation

To install the necessary dependencies, run the following command in a fresh conda environment. I suggest Python 3.10:

%pip install -r requirements.txt

Retrieval Augmented Generation

The Retrieval Augmented Generation (RAG) demonstration uses the BAAI/bge_small_en_v2 model to encode a query and retrieve examples based on title similarity using FAISS. The examples are then summarized using Hermes-2.5-Mistral-7B.

Reranking Results

Demos are included for classical (score augmentation) and LLM-based (experimental) reranking of results. The experimental LLM reranking process uses the aforementioned model to return a list instructing the LLM to rerank and drop irrelevant results. The results are then displayed as a table with hyperlinks.

Streamlit SSE (Spatial Search Engine) Demo

To run the Streamlit demo, simply navigate to the demo directory and run the Streamlit app:

cd streamlit-demo
streamlit run app.py

License

This project is licensed under the Apache License 2.0. For more details, see the LICENSE file in the repository.

For more detailed instructions and examples, refer to the getting_started.ipynb notebook.

About

Pull high-quality, efficient embeddings for PubMed, arXiv and Wikipedia from Huggingface and use for local LLM inference/Retrieval Augmented Generation (RAG)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published