Skip to content

This project is a Retrieval-Augmented Generation (RAG) Tool designed to help users better understand scientific studies and avoid misinformation propagated by clickbait or hyperbolic posts.

Notifications You must be signed in to change notification settings

dapperAuteur/kys-rag

Repository files navigation

Project Description

This project is a Retrieval-Augmented Generation (RAG) Tool designed to help users better understand scientific studies and avoid misinformation propagated by clickbait or hyperbolic posts. The tool uses free and open-source technologies to provide insights, detect misinformation, and promote critical thinking when analyzing scientific claims.

Key features include:

  • NLP with Hugging Face SciBERT to understand scientific text and communicate effectively with users from diverse socioeconomic backgrounds.
  • FAISS for fast vector similarity search, ensuring quick retrieval of relevant study chunks.
  • MongoDB for persistent storage of metadata, enabling advanced indexing by topic, discipline, and other dimensions.
  • Python FastAPI to power backend services with minimal overhead and fast execution.
  • Next.js for the frontend, enabling an interactive, SEO-friendly interface that supports RAG-based chat, multi-user support, and more.

The tool includes features like chunking, misinformation detection, bias awareness, and user feedback to iteratively improve accuracy and relevance. It supports multiple users, user invitations, and contribution forms for submitting scientific studies and articles for analysis.

The project emphasizes accessibility, transparency, and critical thinking by encouraging users to cross-reference multiple studies and learn from diverse perspectives.


README.md

# Retrieval-Augmented Generation (RAG) Tool for Scientific Literacy

## Introduction
This RAG tool is designed to combat misinformation by providing a clear and unbiased understanding of scientific studies. It leverages state-of-the-art Natural Language Processing (NLP) and information retrieval techniques to break down complex research papers and articles into accessible insights. The tool fosters critical thinking and transparency in science communication, making it valuable for individuals, researchers, and organizations aiming to understand the facts behind the headlines.

## Features
### 1. **Core Functionalities**
- **RAG-Based Chat**:
  - Engages users in meaningful conversations about scientific studies.
  - Allows users to **persist conversations** and invite others to participate.
  - Multi-user support with admin privileges to manage access.
- **Misinformation Detection**:
  - Identifies and flags misleading claims in scientific articles.
  - Provides context-aware summaries to highlight biases or gaps.
- **Bias Awareness**:
  - Informs users about potential biases in studies or claims.
- **User Feedback Loop**:
  - Users can rate responses for relevance, accuracy, and helpfulness to iteratively improve the system.

### 2. **Data Ingestion**
- **Article Submission**:
  - Submit article/blog links for review.
  - Cross-check citations to ensure alignment with claims.
- **Scientific Study Submission**:
  - Upload PDFs or provide links to studies.
  - Metadata stored for retrieval (e.g., topic, discipline).
- **Web Scraping**:
  - Uses BeautifulSoup to scrape and ingest content from websites.

### 3. **Text Processing**
- **Chunking**:
  - Breaks studies into smaller chunks for embedding.
  - Uses overlapping windows to maintain context during retrieval.
- **Indexing**:
  - FAISS ensures fast vector search for chunked data.
  - MongoDB stores metadata with indexing by topic, discipline, and other dimensions.

### 4. **Feedback and Continuous Improvement**
- Captures common user queries to improve FAQ-like responses.
- Encourages users to cross-reference studies for a balanced understanding.

---

## Technical Stack
### **Frontend**
- **Next.js**: Provides a modern, SEO-optimized, and interactive interface for:
  - Chat-based interactions.
  - User forms for invitations and submissions.
  - Deployment on **Vercel (free tier)** with a custom domain.

### **Backend**
- **Python FastAPI**: Powers the backend with:
  - API routes for interacting with FAISS and MongoDB.
  - Minimal overhead, fast execution, and modern API design.
- **BeautifulSoup**: Scrapes content from submitted links.

### **Data Storage and Retrieval**
- **FAISS**: Ensures high-speed vector similarity search for study embeddings.
- **MongoDB**: Stores metadata for persistent indexing and search.

### **NLP**
- **Hugging Face SciBERT**:
  - Pretrained on biomedical and scientific text.
  - Accessible for users with diverse backgrounds and educational levels.

---

## Deployment
### **Frontend**
- Hosted on **Vercel** (free tier) with a custom domain.
- Provides global content delivery, automatic SSL, and seamless integration with Next.js.

### **Backend**
- Options for hosting:
  1. **Render (free tier)**:
     - Ideal for Python-based FastAPI services.
  2. **Fly.io**:
     - Supports free-tier deployments with scalability options.
  3. **Deta**:
     - Free for lightweight FastAPI apps.

---

## Strengths of the Chosen Approach
### **1. Scalability and Speed**
- **FAISS**: Handles large datasets quickly, ensuring a smooth user experience.
- **MongoDB**: Persistent metadata allows efficient indexing and retrieval.

### **2. Accessibility**
- **Hugging Face SciBERT**: Tailored for scientific language, breaking down barriers for users from diverse backgrounds.

### **3. Developer Productivity**
- **Next.js**: Simplifies frontend development with built-in SSR and API routes.
- **FastAPI**: Reduces backend complexity with its modern design and fast execution.

### **4. User Engagement**
- Multi-user chat capabilities and forms for contributions foster collaboration and community building.

---

## Weaknesses of the Approach
### **1. Resource Limitations**
- **FAISS and MongoDB**: Combining these adds complexity, especially in synchronization and scalability.
- **Free Hosting**: Limited resources on free-tier platforms may impact performance during high usage.

### **2. Learning Curve**
- Next.js and FastAPI have slight learning curves, particularly for developers unfamiliar with SSR/modern APIs.

### **3. NLP Model Limitations**
- Hugging Face models like SciBERT might require fine-tuning for optimal performance with specific data.

---

## Getting Started
### **1. Clone the Repository**
```bash
git clone https://github.com/your-username/scientific-rag-tool.git
cd scientific-rag-tool

2. Install Dependencies

Backend:

pip install -r requirements.txt

Frontend:

npm install

3. Run the Application

  • Start the FastAPI backend:
uvicorn app.main:app --reload
  • Start the Next.js frontend:
npm run dev

4. Deploy

Follow deployment instructions for Vercel (frontend) and Render/Fly.io (backend).


Contributing

Contributions are welcome! Please open an issue or submit a pull request for improvements.


Contact

For questions or inquiries, reach out at [email protected]. Chat with Brand Anthony McDonald in real-time by visiting https://i.brandanthonymcdonald.com/portfolio


This README.md provides a comprehensive overview of the project, technical stack, features, deployment, and strengths/weaknesses for potential contributors and employers. Let me know if you'd like to refine or expand any sections!

Made with ❤️ by [BAM](https://brandanthonymcdonald.com/)

About

This project is a Retrieval-Augmented Generation (RAG) Tool designed to help users better understand scientific studies and avoid misinformation propagated by clickbait or hyperbolic posts.

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published