This is the code repo for the paper "SEED: Domain-Specific Data Curation With Large Language Models".
SEED is an approach that leverages Large Language Models (LLMs) to automatically generate domain-specific data curation solutions. By describing a task, input data, and expected output, the SEED compiler produces an executable pipeline consisting of LLM-generated code, small models, and data access modules.
Current Version: v0.2.0
Compatibility: Tested on MacBook M1 Pro, MacOS 12.6.2, Python 3.9, Pytorch 1.13.1
Hardware Requirements: None
First install the prerequisites and the SEED package.
git clone [email protected]:Magolor/SEED.git
cd ./SEED/SeeD/
pip install -r requirements.txt
pip install -e .
cd ..
The most basic config is the auth key to OpenAI API. You can create a openai_api.json
file to store it:
{
"model": "gpt-4-turbo-preview",
"api_key": "<YOUR_API_KEY>",
"organization": "<YOUR_ORGANIZATION>"
}
Or you will be asked to manually input them in terminals during SEED setup.
Then run the installer to initialize configurations:
python post_install.py
- Recommended: A full tutorial for understanding how SEED works in general: amazon_google_full_tutorial.
- A short version of the same amazon google project: amazon_google_tutorial.
- A code generation agent tutorial: restaurant_tutorial.
- Others:
- pubmed_tutorial
- ...
SEED is currently under development, many features and optimizations coming!
- Improve code generation.
- Add Seq2Seq model support from legacy.
- Add tools agent from legacy.
- Improve RAG.
- Improve hyperparaeter search.
- Integrate hyperparaeter search.
- Support multiple outputs.
- ...