-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): integrate semantic search into /search_books endpoint #20
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Owner
ddayto21
commented
Feb 10, 2025
- Created search index for book corpus designed for semantic search.
- Implemented book pipeline to extract metadata for book corpus.
- Added text normalization and book metadata preprocessing.
- Updated _basic_nlp_cleanup() to remove stopwords in addition to lowercasing and trimming input. - Modified _refine_query() to detect and filter out extra debugging text (e.g. "Detected Place(s):", "Original Query:") from generated output, so that if the generated result is invalid it falls back to simply returning the cleaned original query plus any extracted keywords. - These changes ensure that the query "The Cat and The Hat" is correctly normalized to "cat hat", satisfying the expected output in the integration/unit tests. - This commit improves the robustness of our query processing pipeline and ensures consistency with our test expectations.
- Created a new LLMWorker class in app/services/llm_worker.py to handle CPU-bound tasks via ThreadPoolExecutor - Updated LLMClient (app/clients/llm_client.py) to use LLMWorker for pipeline inference calls, preventing the main event loop from blocking - Added unit tests in app/tests/services/test_llm_worker.py to verify concurrency, exception handling, and proper thread pool shutdown
- Introduce a new test suite in app/tests/clients/test_llm_client.py - Exercises the LLMClient with real pipeline calls (no mocks) - Verifies end-to-end functionality for NLP cleanup, entity extraction, zero-shot classification, and text generation
… keyword extraction using the Ollama API. - Implemented `extract_keywords(query)`, an async function that sends user queries to the model. - Configured a system prompt to ensure extracted keywords are returned in a comma-separated format. - Utilized `AsyncClient` from Ollama to manage API interactions. This module will be used for extracting relevant keywords from user queries efficiently.
- Simplified LLMClient by removing unused worker logic and added conditional resource cleanup for httpx AsyncClient. - Updated test_llm_client.py to include enhanced fuzzy matching for keyword extraction, ensuring tests reflect real LLM responses.
- Deleted app/services/llm_worker.py and its tests as they are no longer needed. - Updated app/api/routes.py to use the simplified LLMClient directly without offloading to a worker.
- Updated dependency versions and lockfile to reflect recent changes and improvements.
- Replaced per-request AsyncClient instantiation with a single, reusable client for connection pooling. - Removed redundant argument handling in fetch_data to simplify the code path. - Reduced overhead and improved response latency by streamlining error handling.
- Changed the fixture scope from module to function to avoid "Event loop is closed" errors. - Each test now creates its own OpenLibraryAPI client instance, ensuring proper cleanup. - Improved reliability of asynchronous tests.
…tion tests - Replaced deprecated on_event startup/shutdown decorators in app/main.py with a lifespan context manager. - Updated integration tests in app/tests/integration/api/test_main_route.py to use asgi_lifespan for proper lifecycle handling. - Ensured that app.state is correctly initialized during tests.
…ch tests - Modified app/api/routes.py to work with the updated OpenLibraryAPI client. - Added app/clients/__init__.py for package initialization. - Updated app/clients/open_library_api_client.py to support subject-based queries via the search endpoint. - Added app/tests/clients/test_search_subjects.py to validate the new subject search functionality.
- Updated pyproject.toml and poetry.lock to include asgi-lifespan for managing FastAPI lifespan events. - Modified pytest.ini to configure async fixture loop scope for proper lifespan handling.
- Set asyncio_default_fixture_loop_scope to "function" in pytest.ini to address deprecation warnings. - Verified that all unit and integration tests pass with asgi-lifespan and updated event handling.
… build error - Updated the Dockerfile to install build-essential and gcc. - This enables compilation of C extensions (e.g. blis) on the slim Python image.
- Removed manual on_startup/on_shutdown calls in test fixtures. - Added LifespanManager to handle FastAPI’s lifespan context. - Ensured each test uses an ASGITransport-based AsyncClient for in-process requests.
Added a new service 'ollama' to the docker-compose.yml file. This service spins up a container to serve a Large Language Model (LLM) using the `ollama/ollama` image. The service is configured to: - Pull the latest `ollama/ollama` image. - Download the `llama3.2` model and serve it on port `11435`. - Use a named volume (`ollama`) to persist model files and avoid re-downloading. - Restart automatically unless explicitly stopped. The service is integrated with the existing `api` service, which now depends on `ollama` for LLM functionality. This addition enables the application to leverage LLM capabilities seamlessly within the Docker environment.
…etween search query and book embeddings Implement a script to calculate cosine similarity scores between a given search query and a set of precomputed book embeddings. The script loads book metadata and embeddings from JSON files, processes the search query, and displays the top N related books based on their similarity scores. Changes: * Added `load_book_embeddings` and `load_books_metadata` functions to load book metadata and embeddings from JSON files. * Implemented `calculate_similarity_scores` function to compute cosine similarities between a search query and book embeddings. * Added `get_top_k_books` function to retrieve the top N related books based on similarity scores. * Configured logging for detailed output and initialized the SentenceTransformer model global
…tions, and enhance logging - Replace hardcoded absolute file paths with relative paths using pathlib to improve portability. - Refactor code into modular functions for text normalization, preprocessing, embedding input creation, and embedding generation. - Enhance logging and error handling throughout the pipeline for better traceability and debugging. - Update docstrings for clarity and maintainability. This refactor improves the overall code quality and maintainability of the embedding generation pipeline.
… preprocessing to separate module - Extracted normalize_text, subjects_to_string, preprocess_book, and create_embedding_input functions into a new module 'preprocessing.py'. - Updated generate_embeddings.py to import preprocessing functions from preprocessing.py. - Replaced hardcoded absolute paths with relative paths using pathlib for improved portability. - Enhanced logging and error handling throughout the pipeline. - Separated concerns into distinct functions for better readability and maintainability. This refactor modularizes the embedding generation process and improves overall code quality.
- Combined functionality from `preprocessing.py` and `preprocess.py` into a single module. - Removed duplicate text normalization routines by unifying them with spaCy-based processing. - Renamed functions (e.g., `normalize_subjects`, `preprocess_book_record`, `generate_embedding_input`) for clearer intent and readability. - Updated the file processing pipeline to reuse shared normalization and preprocessing logic. - Improved error handling and logging throughout the pipeline.
- Ensures that create_vector_embedding correctly generates embeddings while handling different text cases.
- Added more test data to the `test_calculate_similarity_scores` function. - Fixed an assertion issue in the `test_get_top_k_books` function where it checked for `numel()` instead of `len()`. - Improved the overall structure and organization of the test cases.
- Loaded book embeddings and metadata into memory on startup for optimized performance. - Implemented query embedding generation using SentenceTransformer. - Computed cosine similarity scores between user queries and stored book embeddings. - Retrieved the top 5 recommended books based on similarity scores. - Added Redis caching to prevent redundant searches and improve response times.
…stAPI lifespan - Moved SentenceTransformer model initialization to FastAPI lifespan event. - Ensured embeddings and metadata are preloaded in `app.state` for efficient access. - Added error handling for model, embeddings, and Redis failures to prevent crashes. - Implemented cleanup of GPU memory and subprocesses on shutdown. - Fixed "KeyError: 'model'" by verifying app state before accessing resources.
…tate - Updated `/search_books` route to retrieve model and embeddings from `app.state` instead of loading them at the module level. - Added error handling to prevent crashes when model or embeddings are not available. - Implemented safeguards using `getattr()` to check for missing attributes before accessing them. - Ensured Redis cache lookups fail gracefully instead of breaking the API. - Improved logging for better observability of search queries and system state.
…ture - Updated test assertions to match the new API response format (list of books instead of a "recommendations" key). - Ensured tests validate required fields: title, author, year, book_id, and subjects. - Fixed caching test to confirm Redis returns identical results on repeated queries. - Added explicit checks for missing model and embeddings, ensuring proper 500 error responses.
- Integrated pytest-watch into the testing pipeline. - Configured pytest-watch to automatically re-run tests when code changes are detected. - Updated the `conftest.py` file to include necessary fixtures for pytest-watch.
- Created a new directory called `books_metadata` to store book metadata. - Added basic structure and organization for storing metadata files.
- Moved Redis book_cache initialization to FastAPI lifespan event. - Ensured book_cache is always set in `app.state` to prevent AttributeError. - Improved error handling for Redis connection failures, logging issues instead of crashing. - Updated `/healthcheck/redis` to check if book_cache exists before accessing Redis. - Fixed test cases to handle scenarios where Redis is unavailable. - Ensured application startup properly initializes all required dependencies.
- Refactored `app/api/routes.py` to improve request handling and optimize API structure. - Updated `app/main.py` to ensure proper service initialization and dependency management. - Removed deprecated services (`book_processor.py`, `generate_embedding.py`, `preprocess_books.py`) to consolidate preprocessing and embedding generation. - Optimized `app/services/semantic_search.py` for efficient retrieval and similarity computation.
These files will be used to build a book corpus designed for information retrieval and semantic search in the pipeline. - Added `books.json` to store structured book data categorized by subject. - Added `book_metadata.json` containing preprocessed book records for indexing. - Added `book_embeddings.json` to store vector embeddings for book metadata.
- Added `extract_subjects()` to scrape available book subjects from Open Library. - Implemented `extract_books()` to retrieve book metadata categorized by subject. - Introduced `fetch_book_metadata()` to fetch detailed metadata for a given book using its work ID. - Saves extracted metadata in structured JSON format for indexing.
…reprocessing - Added `normalize_text()` to clean and standardize text by lowercasing, removing special characters, and applying tokenization, lemmatization, and stopword removal. - Implemented `normalize_subjects()` to ensure consistency in book subject categorization. - Introduced `preprocess_book_record()` to normalize book metadata, handling title, author, subjects, and publication year. - Developed `format_book_for_embedding()` to structure metadata for vector embedding generation. - Integrated logging for improved traceability during preprocessing. - Ensured preprocessed data is saved in `book_metadata.json` for downstream indexing and embedding.
…k metadata and embeddings - Added `app/pipelines/load.py` to manage data loading and saving operations for book metadata and embeddings. - Implemented `load_json_file()` to read structured data from JSON files. - Introduced `save_subjects_metadata()` to store extracted subject lists in JSON format. - Developed `save_book_metadata()` to save structured book metadata for downstream processing. - Implemented `save_book_embeddings()` to store vectorized book embeddings for retrieval. - Added `load_book_embeddings()` to load stored embeddings as NumPy arrays. - Created `load_book_metadata()` to read book metadata for semantic search indexing.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.