CurateGPT is a prototype web application and framework for performing general purpose AI-guided curation and curation-related operations over collections of objects. #979
Labels
AI-Agents
Autonomous AI agents using LLMs
AI-Chatbots
Topics related to advanced chatbot platforms integrating multiple AI models
Automation
Automate the things
embeddings
vector embeddings and related tools
llm-applications
Topics related to practical applications of Large Language Models in various fields
openai
OpenAI APIs, LLMs, Recipes and Evals
source-code
Code snippets
CurateGPT
CurateGPT is a prototype web application and framework for performing general purpose AI-guided curation and curation-related operations over collections of objects.
See also the app on curategpt.io (note: this is sometimes down, and may only have a subset of the functionality of the local app)
Getting Started
User Installation
CurateGPT is available on Pypi and may be installed with
pip
:Developer Installation
You will first need to install Poetry.
Then clone this repo:
git clone https://github.com/monarch-initiative/curategpt.git cd curategpt
Install the dependencies:
API Keys
To get the best performance from CurateGPT, we recommend getting an OpenAI API key and setting it:
(for members of Monarch: ask on Slack if you would like to use the group key)
CurateGPT will also work with other large language models - see "Selecting models" below.
Loading Example Data and Running the App
You initially start with an empty database. You can load whatever you like into this database! Any JSON, YAML, or CSV is accepted.
CurateGPT comes with wrappers for some existing local and remote sources, including ontologies. The Makefile contains some examples of how to load these. You can load any ontology using the
ont-<name>
target, e.g.:This loads CL (via OAK) into a collection called
ont_cl
Note that by default this loads into a collection set stored at
stagedb
, whereas the app works off ofdb
. You can copy the collection set to the db with:cp -r stagedb/* db/
Run the Streamlit app with:
Building Indexes
CurateGPT depends on vector database indexes of the databases/ontologies you want to curate.
The flagship application is ontology curation, so to build an index for an OBO ontology like CL:
This requires an OpenAI key.
(You can build indexes using an open embedding model, modify the command to leave off the
-m
option, but this is not recommended as currently OAI embeddings seem to work best).To load the default ontologies:
(this may take some time)
To load different databases:
You can load an arbitrary JSON, YAML, or CSV file:
(you will need to do this in the poetry shell)
To load a GitHub repo of issues:
curategpt -v view index -c gh_uberon -m openai: --view github --init-with "{repo: obophenotype/uberon}"
The following are also supported:
Notebooks
See notebooks for examples.
Selecting Models
Currently this tool works best with the OpenAI gpt-4 model (for instruction tasks) and OpenAI
ada-text-embedding-002
for embedding.CurateGPT is layered on top of simonw/llm which has a plugin architecture for using alternative models. In theory you can use any of these plugins.
Additionally, you can set up an OpenAI-emulating proxy using litellm.
The
litellm
proxy may be installed withpip
aspip install litellm[proxy]
.Let's say you want to run Mixtral locally using Ollama. You start up Ollama (you may have to run
ollama serve
first):Then start up litellm:
Next edit your
extra-openai-models.yaml
as detailed in the llm docs:You can now use this:
curategpt ask -m litellm-mixtral -c ont_cl "What neurotransmitter is released by the hippocampus?"
But be warned that many of the prompts in CurateGPT were engineered against OpenAI models, and they may give suboptimal results or fail entirely on other models. As an example,
ask
seems to work quite well with Mixtral, butcomplete
works horribly. We haven't yet investigated if the issue is the model or our prompts or the overall approach.Welcome to the world of AI engineering!
Using the Command Line
You will see various commands for working with indexes, searching, extracting, generating, etc.
These functions are generally available through the UI, and the current priority is documenting these.
Chatting with a Knowledge Base
curategpt ask -c ont_cl "What neurotransmitter is released by the hippocampus?"
may yield something like:
Chatting with PubMed
curategpt view ask -V pubmed "what neurons express VIP?"
Chatting with a GitHub Issue Tracker
curategpt ask -c gh_obi "what are some new term requests for electrophysiology terms?"
Term Autocompletion (DRAGON-AI)
yields:
All-by-All Comparisons
You can compare all objects in one collection:
This takes 1-2s, as it involves comparison over pre-computed vectors. It reports top hits above a threshold.
Results may vary. You may want to try different texts for embeddings (the default is the entire JSON object; for ontologies it is concatenation of labels, definition, aliases).
Sample:
Note that CurateGPT has a separate component for using an LLM to evaluate candidate matches (see also https://arxiv.org/abs/2310.03666); this is not enabled by default, this would be expensive to run for a whole ontology.
Suggested labels
None
The text was updated successfully, but these errors were encountered: