Skip to content

Commit

Permalink
Version 3 Refactor (#135)
Browse files Browse the repository at this point in the history
* First pass at refactor

* Removed outdated unit tests

* Adjusted build pipeline

* Bumped version

* Added notes on the caching

* Working through unit tests 1

* Fixed mypy errors

* More progress on unit tests

* Linted

* Fixed remaining unit tests

* More typing stuff

* Added caching info to README

* Changed doc.name to doc.docname

* Updated testing environment to 3.11

* Partially completed conversion to new prompt system

* Fixed more tests

* Fixed last tests

* Fixed missing import

* Completed custom prompts

* Added the fileio reader

* Added url support

* Simplifed some tests

* Fixed conditions on test

* Fixed README

* Updated TOC

* Reduced section heading level on TOC

* Revised README more
  • Loading branch information
whitead authored Jun 10, 2023
1 parent 736de35 commit 1f60272
Show file tree
Hide file tree
Showing 21 changed files with 1,126 additions and 1,148 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.11"]

steps:
- uses: actions/checkout@v2
Expand All @@ -24,8 +24,9 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest build
if [ -f dev-requirements.txt ]; then pip install -r dev-requirements.txt; fi
- name: Check pre-commit
run: pre-commit run --all-files || ( git status --short ; git diff ; exit 1 )
- name: Install
run: |
pip install .
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -132,4 +132,4 @@ dmypy.json
*.txt.json

*.ipynb
env
env
48 changes: 32 additions & 16 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,32 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.2.3
hooks:
- id: trailing-whitespace
- id: check-yaml
- id: end-of-file-fixer
- id: mixed-line-ending
- repo: https://github.com/psf/black
rev: "22.3.0"
hooks:
- id: black
- repo: https://github.com/isort/isort
rev: "5.11.2"
hooks:
- id: isort
default_language_version:
python: python3
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: trailing-whitespace
- id: check-yaml
- id: end-of-file-fixer
- id: mixed-line-ending
- id: check-added-large-files
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: "v0.0.270"
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
- repo: https://github.com/psf/black
rev: "23.3.0"
hooks:
- id: black
language_version: python3
- repo: https://github.com/pre-commit/mirrors-mypy
rev: "v1.3.0"
hooks:
- id: mypy
args: [--pretty, --ignore-missing-imports]
additional_dependencies: [types-requests]
- repo: https://github.com/PyCQA/isort
rev: "5.12.0"
hooks:
- id: isort
args: [--profile=black, "--skip=__init__.py", "--filter-files"]
2 changes: 2 additions & 0 deletions .ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Allow lines to be as long as 120 characters.
line-length = 120
192 changes: 162 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,54 @@
# Paper QA
# Paper QA- [Paper QA](#paper-qa)
- [Paper QA- Paper QA](#paper-qa--paper-qa)
- [Output Example](#output-example)
- [References](#references)
- [Hugging Face Demo](#hugging-face-demo)
- [Install](#install)
- [Usage](#usage)
- [Adding Documents](#adding-documents)
- [Choosing Model](#choosing-model)
- [Adjusting number of sources](#adjusting-number-of-sources)
- [Using Code or HTML](#using-code-or-html)
- [Version 3 Changes](#version-3-changes)
- [New Features](#new-features)
- [Naming](#naming)
- [Breaking Changes](#breaking-changes)
- [Notebooks](#notebooks)
- [Agents (experimental)](#agents-experimental)
- [Where do I get papers?](#where-do-i-get-papers)
- [Zotero](#zotero)
- [Paper Scraper](#paper-scraper)
- [PDF Reading Options](#pdf-reading-options)
- [Typewriter View](#typewriter-view)
- [Caching](#caching-1)
- [Caching Embeddings](#caching-embeddings)
- [Customizing Prompts](#customizing-prompts)
- [Pre and Post Prompts](#pre-and-post-prompts)
- [FAQ](#faq)
- [How is this different from LlamaIndex?](#how-is-this-different-from-llamaindex)
- [How is this different from LangChain?](#how-is-this-different-from-langchain)
- [Can I use different LLMs?](#can-i-use-different-llms)
- [Where do the documents come from?](#where-do-the-documents-come-from)
- [Can I save or load?](#can-i-save-or-load)


[![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/whitead/paper-qa)
[![tests](https://github.com/whitead/paper-qa/actions/workflows/tests.yml/badge.svg)](https://github.com/whitead/paper-qa)
[![PyPI version](https://badge.fury.io/py/paper-qa.svg)](https://badge.fury.io/py/paper-qa)

This is a minimal package for doing question and answering from
PDFs or text files (which can be raw HTML). It strives to give very good answers, with no hallucinations, by grounding responses with in-text citations. It uses [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings) with a vector DB called [FAISS](https://github.com/facebookresearch/faiss) to embed and search documents. [langchain](https://github.com/hwchase17/langchain) helps
generate answers.
PDFs or text files (which can be raw HTML). It strives to give very good answers, with no hallucinations, by grounding responses with in-text citations.

It uses the process shown below:
By default, it uses [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings) with a vector DB called [FAISS](https://github.com/facebookresearch/faiss) to embed and search documents. However, via [langchain](https://github.com/hwchase17/langchain) you can use open-source models or embeddings (see details below).

```
embed docs into vectors -> embed query into vector -> search for top k passages in docs
PaperQA uses the process shown below:

create summary of each passage relevant to query -> put summaries into prompt -> generate answer
```

<img src="https://user-images.githubusercontent.com/908389/230854097-8fa96768-c694-45c0-bb04-3a7386facef3.jpeg" width="600" alt="Process of vector search, refinement, and answer with context">
1. embed docs into vectors
2. embed query into vector
3. search for top k passages in docs
4. create summary of each passage relevant to query
5. put summaries into prompt
6. generate answer with prompt

## Output Example

Expand All @@ -32,7 +64,6 @@ Tulevski2007: Tulevski, George S., et al. "Chemically assisted directed assembly

Chen2014: Chen, Haitian, et al. "Large-scale complementary macroelectronics using hybrid integration of carbon nanotubes and IGZO thin-film transistors." Nature communications 5.1 (2014): 4097.


## Hugging Face Demo

[Hugging Face Demo](https://huggingface.co/spaces/whitead/paper-qa)
Expand Down Expand Up @@ -67,6 +98,10 @@ print(answer.formatted_answer)

The answer object has the following attributes: `formatted_answer`, `answer` (answer alone), `question`, `context` (the summaries of passages found for answer), `references` (the docs from which the passages came), and `passages` which contain the raw text of the passages as a dictionary.

### Adding Documents

`add` will add from paths. You can also use `add_file` (expects a file object) or `add_url` to work with other sources.

### Choosing Model

By default, it uses a hybrid of `gpt-3.5-turbo` and `gpt-4`. If you don't have gpt-4 access or would like to save money, you can adjust:
Expand All @@ -78,9 +113,9 @@ docs = Docs(llm='gpt-3.5-turbo')
or you can use any other model available in [langchain](https://github.com/hwchase17/langchain):

```py
from langchain.llms import Anthropic, OpenAIChat
model = OpenAIChat(model='gpt-4')
summary_model = Anthropic(model="claude-instant-v1-100k", anthropic_api_key="my-api-key")
from langchain.chat_models import ChatAnthropic, ChatOpenAI
model = ChatOpenAI(model='gpt-4')
summary_model = ChatAnthropic(model="claude-instant-v1-100k", anthropic_api_key="my-api-key")
docs = Docs(llm=model, summary_llm=summary_model)
```

Expand Down Expand Up @@ -147,6 +182,55 @@ answer = docs.query("Where is the search bar in the header defined?")
print(answer)
```

## Version 3 Changes

Version 3 includes many changes to type the code, make it more focused/modular, and enable performance to very large numbers of documents. The major breaking changes are documented below:


### New Features

The following new features are in v3:

1. `add_url` and `add_file` are now supported for adding from URLs and file objects
2. Prompts can be customized, and now can be executed pre and post query
3. Consistent use of `dockey` and `docname` for unique and natural language names enable better tracking with external databases
4. Texts and embeddings are no longer required to be part of `Docs` object, so you can use external databases or other strategies to manage them
5. Various simplifications, bug fixes, and performance improvements

### Naming

The following table shows the old names and the new names:

| Old Name | New Name | Explanation |
| :--- | :---: | ---: |
| `key` | `name` | Name is a natural language name for text. |
| `dockey` | `docname` | Docname is a natural language name for a document. |
| `hash` | `dockey` | Dockey is a unique identifier for the document. |


### Breaking Changes


#### Pickled objects

The pickled objects are not compatible with the new version.

#### Agents

The agent functionality has been removed, as it's not a core focus of the library

#### Caching

Caching has been removed because it's not a core focus of the library. See FAQ below for how to use caching.

#### Answers

Answers will not include passages, but instead return dockeys that can be used to retrieve the passages. Tokens/cost will also not be counted since that is built into langchain by default (see below for an example).

#### Search Query

The search query chain has been removed. You can use langchain directly to do this.

## Notebooks

If you want to use this in an jupyter notebook or colab, you need to run the following command:
Expand Down Expand Up @@ -251,6 +335,70 @@ answer = docs.query("What manufacturing challenges are unique to bispecific anti
print(answer)
```

## PDF Reading Options

By default [PyPDF](https://pypi.org/project/pypdf/) is used since it's pure python and easy to install. For faster PDF reading, paper-qa will detect and use [PymuPDF (fitz)](https://pymupdf.readthedocs.io/en/latest/):

```sh
pip install pymupdf
```

## Typewriter View

To stream the completions as they occur (giving that ChatGPT typewriter look), you can simply instantiate models with those properties:

```python
from paperqa import Docs
from langchain.chat_models import ChatOpenAI

my_llm = ChatOpenAI(model='gpt-3.5-turbo', streaming=True)
docs = Docs(llm=my_llm)
```

## Caching

You can using the builtin langchain caching capabilities. Just run this code at the top of yours:

```py
from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()
```

### Caching Embeddings

In general, embeddings are cached when you pickle a `Docs` regardless of what vector store you use. If you would like to manage caching embeddings via an external database or other strategy,
you can populate a `Docs` object directly via
the `add_texts` object. That can take chunked texts and documents, which are serializable objects, to populate `Docs`.

You also can simply use a separate vector database by setting the `doc_index` and `texts_index` explicitly when building the `Docs` object.

## Customizing Prompts

You can customize any of the prompts, using the `PromptCollection` class. For example, if you want to change the prompt for the question, you can do:

```python
from paperqa import Docs, Answer, PromptCollection
from langchain.prompts import PromptTemplate

my_qaprompt = PromptTemplate(
input_variables=["context", "question"],
template="Answer the question '{question}' "
"Use the context below if helpful. "
"You can cite the context using the key "
"like (Example2012). "
"If there is insufficient context, write a poem "
"about how you cannot answer.\n\n"
"Context: {context}\n\n")
prompts=PromptCollection(qa=my_qaprompt)
docs = Docs(prompts=prompts)
```

### Pre and Post Prompts

Following the syntax above, you can also include prompts that
are executed after the query and before the query. For example, you can use this to critique the answer.


## FAQ

### How is this different from LlamaIndex?
Expand All @@ -261,10 +409,6 @@ It's not that different! This is similar to the tree response method in LlamaInd

It's not! We use langchain to abstract the LLMS, and the process is very similar to the `map_reduce` chain in LangChain.

### Caching

This code will cache responses from LLMS by default in `$HOME/.paperqa/llm_cache.db`. Delete this file to clear the cache.

### Can I use different LLMs?

Yes, you can use any LLMs from [langchain](https://langchain.readthedocs.io/) by passing the `llm` argument to the `Docs` class. You can use different LLMs for summarization and for question answering too.
Expand All @@ -288,15 +432,3 @@ with open("my_docs.pkl", "wb") as f:
with open("my_docs.pkl", "rb") as f:
docs = pickle.load(f)
```

### PDF Reading Options

By default [PyPDF](https://pypi.org/project/pypdf/) is used since it's pure python and easy to install. For faster PDF reading, paper-qa will detect and use [PymuPDF (fitz)](https://pymupdf.readthedocs.io/en/latest/):

```sh
pip install pymupdf
```

### Callbacks

TODO
5 changes: 3 additions & 2 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
pytest
pre-commit
requests
paper-scraper@git+https://github.com/blackadad/paper-scraper.git
pyzotero
python-dotenv
pymupdf
pymupdf
build
types-requests
5 changes: 3 additions & 2 deletions paperqa/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .agent import run_agent
from .docs import Answer, Docs, maybe_is_text
from .docs import Answer, Docs, PromptCollection
from .version import __version__

__all__ = ["Docs", "Answer", "PromptCollection", "__version__"]
Loading

0 comments on commit 1f60272

Please sign in to comment.