Climate RAG: Retrieval augmented generation pipeline to identify, scrape and ingest web documents to answer climate-related queries

Features

Supports multiple languages and will auto-translate into other languages to run optimal search
Supports multiple search engines (eg. Baidu is better than Google for Chinese documents)
Extracts information from websites, PDFs (including scanned PDFs!), local text and other sources...
Modular design based on Langchain and Langraph supports multiple industry-standard LLMs including OpenAI GPT-4o, Google Gemini, Anthropic Claude, Meta Llama 3.1, Ollama (local LLMs)
User-friendly proof of concept UI
Command-line tools and python API to enable easy automation
RAG process means LLM (almost) always cites the correct sources.

graph
    Start[Ask a question] --> A
    subgraph "RAG Application Workflow"
        A[Improve Question] --> B[Formulate Query]
        B --> C{Decide to Generate}
        C -- Generate --> D[Retrieve from Database]
        C -- No Generate --> E["Web Search"]
        D --> F[Add Additional Metadata]
        E --> H[Convert to Markdown]
        H --> D
        F --> I{Decide to Rerank}
        I -- Rerank --> J[Rerank Documents]
        I -- No Rerank --> K[Generate]
        J --> K
        K --> L[Ask User for Feedback]
        L -- Web Search --> E
        L -- Happy with Answer --> M[END]
    end

    subgraph "Web Search"
        E --> N{Choose Search Engine}
        N -- Language is Chinese --> P{Baidu}
        N -- Language is not Chinese --> O{Google}
        O --> E
        P --> E
    end

    subgraph "Convert to Markdown"
        H -->  Q[Choose Conversion Tool]
        Q --> R{Firecrawl}
        R -- Fail --> S{Jina.ai}
        S -- Fail --> T{Browser Automation}
        T -- Fail --> V{Gemini}
        R --> H
        S --> H
        T --> H
        V --> H
        T -- Success --> W{Choose Additional Processing}
        W --> X{Jina.ai}
        W --> Y{Local Processing}
        W --> Z{Gemini}
        X --> H
        Y --> H
        Z --> H
    end

    subgraph "Retrieve from Database"
        D --> G[Retrieve using multiple queries]
        G --> D
    end

climate-rag-ui.webm

Installation

Please note: this package is still under development. It has only been tested in a couple of settings so far!

We use uv to manage packages.

pip install uv

Add all the relevant API keys to a .env file in the root directory.

cp .env.example .env

API Keys

The following API keys are required or optional for Climate RAG to function:

API Key	Description	Required	Source
`OPENAI_API_KEY`	Used for OpenAI embeddings and LLM (GPT-4, GPT-3.5)	Optional	https://platform.openai.com/account/api-keys
`TAVILY_API_KEY`	Used for Tavily Search	Optional	https://www.tavily.com/
`LANGCHAIN_API_KEY`	Used for LangChain tracing (optional)	Optional	https://docs.langchain.com/docs/integrations/providers/langchain_hub
`FIRECRAWL_API_KEY`	Used for FireCrawlLoader	Optional	https://firecrawl.io/
`SERPER_API_KEY`	Used for Serper.dev, a Google Search API alternative	Optional	https://serper.dev/
`SERPAPI_API_KEY`	Used for SerpApi, a Google Search API alternative	Optional	https://serpapi.com/
`JINA_API_KEY`	Used for Jina AI document reranking and loading PDF files	Optional	https://cloud.jina.ai/
`COHERE_API_KEY`	Used for Cohere reranking	Optional	https://cohere.ai/
`REDIS_HOSTNAME`	Hostname of your Redis instance	Required	https://redis.io/
`REDIS_PORT`	Port of your Redis instance	Required	https://redis.io/
`CHROMADB_HOSTNAME`	Hostname of your ChromaDB instance	Required	https://docs.trychroma.com/
`CHROMADB_PORT`	Port of your ChromaDB instance	Required	https://docs.trychroma.com/
`STATIC_PATH`	Path to the static directory for serving output files (optional)	Optional
`UPLOAD_FILE_PATH`	Path to the directory where uploaded files are stored (optional)	Optional
`S3_BUCKET`	Name of your S3 bucket (optional)	Optional	https://aws.amazon.com/s3/
`S3_ACCESS_KEY_ID`	Access key ID for your S3 bucket (optional)	Optional	https://aws.amazon.com/s3/
`S3_ACCESS_KEY_SECRET`	Secret access key for your S3 bucket (optional)	Optional	https://aws.amazon.com/s3/
`S3_ENDPOINT_URL`	Endpoint URL for your S3-compatible storage service (e.g., MinIO) (optional)	Optional
`USE_S3`	Whether to use S3 for storing output files (optional, default: `False`)	Optional
`GOOGLE_APPLICATION_CREDENTIALS`	Path to your Google Cloud service account credentials file (required for using Gemini)	Optional	https://cloud.google.com/docs/authentication/getting-started
`GOOGLE_PROJECT_ID`	ID of your Google Cloud project (required for using Gemini)	Optional	https://cloud.google.com/resource-manager/docs/creating-managing-projects

Usage

Running the RAG Pipeline

First, you will need to start a ChromaDB server:

uv run chroma run --path chroma/

Also ensure that you have a Redis Stack server running:

redis-cli ping

To run a query against the RAG pipeline, execute the following command, replacing "Give me a list of coal power plants in Vietnam" with your desired query:

uv run query_data.py "Give me a list of coal power plants in Vietnam"

See the Advanced Options section for additional configuration options.

Running the WebApp

To launch the Climate RAG web app, run the following command:

uv run webapp.py

This will launch the app in your web browser. The web app provides a more user-friendly interface for interacting with the RAG pipeline. It features:

Chat Interface

Ask questions in natural language.
View the AI's responses and the sources used.
Download the answers as PDF or DOCX files.

Previous Queries

Access a history of your previous questions and answers.
Download previous answers as PDF or DOCX files.

Document Management

Upload new documents directly to the database.
Add URLs for the pipeline to scrape and ingest.
Search through existing documents in the database.
View details of specific documents, including titles, company names, sources, dates added, and page lengths.

Console

View the console logs for troubleshooting and monitoring the pipeline's activity.

Populate the Database

To populate the database with relevant documents, you can either provide a list of URLs, upload files directly, or ingest a directory of documents.

Adding URLs

To add new URLs to the database, run the following command, replacing https://www.example.com with the desired URL. You can add multiple URLs by separating them with spaces:

uv run populate_database.py --urls "https://www.example.com" "https://www.another-example.com"

Uploading Files

To add local files to the database, run the following command, replacing path/to/file1.pdf with the desired file path. You can add multiple files by separating them with spaces:

uv run populate_database.py --files "path/to/file1.pdf" "path/to/file2.md"

Ingesting a Directory

To ingest a directory of documents, put your files in the data/ directory and then run the following command:

uv run populate_database.py

Database Cleanup

To remove documents from the database, you can use the cleanup_database.py script. This script will identify and remove documents that contain certain error messages or have a page length shorter than a predefined threshold. To run the script:

uv run cleanup_database.py

Adding Additional Metadata

The RAG pipeline uses metadata to enhance document retrieval. You can manually add additional metadata to a document or enable automatic metadata extraction.

Automatic Metadata Extraction

To enable automatic metadata extraction, use the --add-additional-metadata flag when running the RAG pipeline. This will use a large language model to extract metadata from the documents in the database.

Advanced Options

The RAG pipeline offers various configuration options. You can customize the behavior using the following flags:

--rag-filter: Filter the database search to specific sources (e.g., --rag-filter carbontracker.org).
--improve-question: Improve the query using a language model before searching (default: True).
--rerank: Rerank the retrieved documents using a ranking model (default: True).
--crawl: Crawl the web for additional documents (default: True).
--language: Select the language for the query (e.g., --language zh for Chinese).
--max-search-queries: Set the maximum number of search queries (default: 1).
--initial-generation: Generate an answer before web search (default: True).
--llm: Select the language model to use (default: claude).

For example, to run a query in Chinese with a maximum of 5 search queries, you would use the following command:

uv run query_data.py "给我一份关于越南煤电厂的清单" --language zh --max-search-queries 5

Acknowledgements:

Pixegami for initial RAG workflow
Greg Kamradt for chunking strategies

Contributing

We welcome contributions to Climate RAG. If you'd like to contribute:

Fork the repository.
Create a new branch for your changes.
Make your changes and commit them with clear descriptions.
Push your changes to your forked repository.
Submit a pull request to the main repository.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
static		static
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Climate RAG flowchart.png		Climate RAG flowchart.png
README.md		README.md
add_urls_to_db_via_gemini.py		add_urls_to_db_via_gemini.py
agents.py		agents.py
api.py		api.py
cache.py		cache.py
chromium.py		chromium.py
cleanup_database.py		cleanup_database.py
constants.py		constants.py
download_urls_with_errors.py		download_urls_with_errors.py
get_embedding_function.py		get_embedding_function.py
graph.py		graph.py
helpers.py		helpers.py
llms.py		llms.py
pdf_download.py		pdf_download.py
populate_database.py		populate_database.py
process_pdf_via_gemini.py		process_pdf_via_gemini.py
prompts.py		prompts.py
pyproject.toml		pyproject.toml
query_data.py		query_data.py
schemas.py		schemas.py
text_splitters.py		text_splitters.py
tools.py		tools.py
uv.lock		uv.lock
webapp.py		webapp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Climate RAG: Retrieval augmented generation pipeline to identify, scrape and ingest web documents to answer climate-related queries

Features

Installation

API Keys

Usage

Running the RAG Pipeline

Running the WebApp

Populate the Database

Database Cleanup

Adding Additional Metadata

Advanced Options

Acknowledgements:

Contributing

About

Releases

Packages

Languages

dldx/climate-rag

Folders and files

Latest commit

History

Repository files navigation

Climate RAG: Retrieval augmented generation pipeline to identify, scrape and ingest web documents to answer climate-related queries

Features

Installation

API Keys

Usage

Running the RAG Pipeline

Running the WebApp

Populate the Database

Database Cleanup

Adding Additional Metadata

Advanced Options

Acknowledgements:

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages