This project is designed to extract and analyze metadata from Gene Expression Omnibus (GEO) studies, focusing on human (Homo sapiens) samples. It uses a combination of database operations, natural language processing, and machine learning techniques to process and interpret the metadata.
-
Database Management
- Uses DuckDB to store and query GEO metadata
- Creates and manages tables for both raw metadata and parsed results
-
Metadata Extraction
- Extracts relevant information from XML files
- Focuses on human studies by filtering for "Homo sapiens" organisms
-
Text Processing
- Generates text descriptions for each study based on various metadata fields
- Includes information such as title, summary, overall design, treatment protocols, and more
-
LLM-based Information Extraction
- Utilizes Language Models (LLMs) for advanced information extraction
- Supports multiple LLM providers including Groq and Azure OpenAI
- Extracts specific fields like high-level indication, drug exposure, modalities, etc.
-
Langfuse Integration
- Incorporates Langfuse for prompt management and observability
-
Extensible Architecture
- The
Extractor
class serves as a base for creating specialized extractors GSEmetaExtractor
is an example of a specialized extractor for GEO metadata
- The
- Processes GEO studies in batches
- Saves intermediate results to avoid redundant processing
- Provides progress tracking and error handling
- Allows for easy extension to extract different types of information
- Ensure the necessary environment variables are set (e.g., API keys for LLM providers)
- Run the metadata extraction script to populate the DuckDB database
- Use the
GSEmetaExtractor
or create custom extractors to process the metadata - Analyze the extracted information stored in the database and JSON files
- Python 3.x
- DuckDB
- Langfuse
- OpenAI or Groq API access
This project uses environment variables for configuration. Create a .env
file in the root directory of the project and add the following variables:
GROQ_API_KEY
: API key for GroqLANGFUSE_SECRET_KEY
: Secret key for LangfuseLANGFUSE_PUBLIC_KEY
: Public key for LangfuseLANGFUSE_HOST
: Host URL for LangfuseAZURE_OPENAI_ENDPOINT
: Endpoint URL for Azure OpenAIAZURE_OPENAI_API_KEY
: API key for Azure OpenAIOPENAI_API_VERSION
: Version of the OpenAI API to use
Ensure all necessary API keys and endpoints are properly set in the .env
file before running the project. The actual values for these variables should be kept confidential and never committed to version control.