feat(api): integrate semantic search into /search_books endpoint #20

ddayto21 · 2025-02-10T22:29:39Z

Created search index for book corpus designed for semantic search.
Implemented book pipeline to extract metadata for book corpus.
Added text normalization and book metadata preprocessing.

- Updated _basic_nlp_cleanup() to remove stopwords in addition to lowercasing and trimming input. - Modified _refine_query() to detect and filter out extra debugging text (e.g. "Detected Place(s):", "Original Query:") from generated output, so that if the generated result is invalid it falls back to simply returning the cleaned original query plus any extracted keywords. - These changes ensure that the query "The Cat and The Hat" is correctly normalized to "cat hat", satisfying the expected output in the integration/unit tests. - This commit improves the robustness of our query processing pipeline and ensures consistency with our test expectations.

- Created a new LLMWorker class in app/services/llm_worker.py to handle CPU-bound tasks via ThreadPoolExecutor - Updated LLMClient (app/clients/llm_client.py) to use LLMWorker for pipeline inference calls, preventing the main event loop from blocking - Added unit tests in app/tests/services/test_llm_worker.py to verify concurrency, exception handling, and proper thread pool shutdown

- Introduce a new test suite in app/tests/clients/test_llm_client.py - Exercises the LLMClient with real pipeline calls (no mocks) - Verifies end-to-end functionality for NLP cleanup, entity extraction, zero-shot classification, and text generation

… keyword extraction using the Ollama API. - Implemented `extract_keywords(query)`, an async function that sends user queries to the model. - Configured a system prompt to ensure extracted keywords are returned in a comma-separated format. - Utilized `AsyncClient` from Ollama to manage API interactions. This module will be used for extracting relevant keywords from user queries efficiently.

- Simplified LLMClient by removing unused worker logic and added conditional resource cleanup for httpx AsyncClient. - Updated test_llm_client.py to include enhanced fuzzy matching for keyword extraction, ensuring tests reflect real LLM responses.

- Deleted app/services/llm_worker.py and its tests as they are no longer needed. - Updated app/api/routes.py to use the simplified LLMClient directly without offloading to a worker.

- Updated dependency versions and lockfile to reflect recent changes and improvements.

- Replaced per-request AsyncClient instantiation with a single, reusable client for connection pooling. - Removed redundant argument handling in fetch_data to simplify the code path. - Reduced overhead and improved response latency by streamlining error handling.

- Changed the fixture scope from module to function to avoid "Event loop is closed" errors. - Each test now creates its own OpenLibraryAPI client instance, ensuring proper cleanup. - Improved reliability of asynchronous tests.

…tion tests - Replaced deprecated on_event startup/shutdown decorators in app/main.py with a lifespan context manager. - Updated integration tests in app/tests/integration/api/test_main_route.py to use asgi_lifespan for proper lifecycle handling. - Ensured that app.state is correctly initialized during tests.

…ch tests - Modified app/api/routes.py to work with the updated OpenLibraryAPI client. - Added app/clients/__init__.py for package initialization. - Updated app/clients/open_library_api_client.py to support subject-based queries via the search endpoint. - Added app/tests/clients/test_search_subjects.py to validate the new subject search functionality.

- Updated pyproject.toml and poetry.lock to include asgi-lifespan for managing FastAPI lifespan events. - Modified pytest.ini to configure async fixture loop scope for proper lifespan handling.

- Set asyncio_default_fixture_loop_scope to "function" in pytest.ini to address deprecation warnings. - Verified that all unit and integration tests pass with asgi-lifespan and updated event handling.

… build error - Updated the Dockerfile to install build-essential and gcc. - This enables compilation of C extensions (e.g. blis) on the slim Python image.

- Removed manual on_startup/on_shutdown calls in test fixtures. - Added LifespanManager to handle FastAPI’s lifespan context. - Ensured each test uses an ASGITransport-based AsyncClient for in-process requests.

Added a new service 'ollama' to the docker-compose.yml file. This service spins up a container to serve a Large Language Model (LLM) using the `ollama/ollama` image. The service is configured to: - Pull the latest `ollama/ollama` image. - Download the `llama3.2` model and serve it on port `11435`. - Use a named volume (`ollama`) to persist model files and avoid re-downloading. - Restart automatically unless explicitly stopped. The service is integrated with the existing `api` service, which now depends on `ollama` for LLM functionality. This addition enables the application to leverage LLM capabilities seamlessly within the Docker environment.

…book embeddings

…etween search query and book embeddings Implement a script to calculate cosine similarity scores between a given search query and a set of precomputed book embeddings. The script loads book metadata and embeddings from JSON files, processes the search query, and displays the top N related books based on their similarity scores. Changes: * Added `load_book_embeddings` and `load_books_metadata` functions to load book metadata and embeddings from JSON files. * Implemented `calculate_similarity_scores` function to compute cosine similarities between a search query and book embeddings. * Added `get_top_k_books` function to retrieve the top N related books based on similarity scores. * Configured logging for detailed output and initialized the SentenceTransformer model global

…tions, and enhance logging - Replace hardcoded absolute file paths with relative paths using pathlib to improve portability. - Refactor code into modular functions for text normalization, preprocessing, embedding input creation, and embedding generation. - Enhance logging and error handling throughout the pipeline for better traceability and debugging. - Update docstrings for clarity and maintainability. This refactor improves the overall code quality and maintainability of the embedding generation pipeline.

… preprocessing to separate module - Extracted normalize_text, subjects_to_string, preprocess_book, and create_embedding_input functions into a new module 'preprocessing.py'. - Updated generate_embeddings.py to import preprocessing functions from preprocessing.py. - Replaced hardcoded absolute paths with relative paths using pathlib for improved portability. - Enhanced logging and error handling throughout the pipeline. - Separated concerns into distinct functions for better readability and maintainability. This refactor modularizes the embedding generation process and improves overall code quality.

- Combined functionality from `preprocessing.py` and `preprocess.py` into a single module. - Removed duplicate text normalization routines by unifying them with spaCy-based processing. - Renamed functions (e.g., `normalize_subjects`, `preprocess_book_record`, `generate_embedding_input`) for clearer intent and readability. - Updated the file processing pipeline to reuse shared normalization and preprocessing logic. - Improved error handling and logging throughout the pipeline.

- Ensures that create_vector_embedding correctly generates embeddings while handling different text cases.

- Added more test data to the `test_calculate_similarity_scores` function. - Fixed an assertion issue in the `test_get_top_k_books` function where it checked for `numel()` instead of `len()`. - Improved the overall structure and organization of the test cases.

- Loaded book embeddings and metadata into memory on startup for optimized performance. - Implemented query embedding generation using SentenceTransformer. - Computed cosine similarity scores between user queries and stored book embeddings. - Retrieved the top 5 recommended books based on similarity scores. - Added Redis caching to prevent redundant searches and improve response times.

…stAPI lifespan - Moved SentenceTransformer model initialization to FastAPI lifespan event. - Ensured embeddings and metadata are preloaded in `app.state` for efficient access. - Added error handling for model, embeddings, and Redis failures to prevent crashes. - Implemented cleanup of GPU memory and subprocesses on shutdown. - Fixed "KeyError: 'model'" by verifying app state before accessing resources.

…tate - Updated `/search_books` route to retrieve model and embeddings from `app.state` instead of loading them at the module level. - Added error handling to prevent crashes when model or embeddings are not available. - Implemented safeguards using `getattr()` to check for missing attributes before accessing them. - Ensured Redis cache lookups fail gracefully instead of breaking the API. - Improved logging for better observability of search queries and system state.

…ture - Updated test assertions to match the new API response format (list of books instead of a "recommendations" key). - Ensured tests validate required fields: title, author, year, book_id, and subjects. - Fixed caching test to confirm Redis returns identical results on repeated queries. - Added explicit checks for missing model and embeddings, ensuring proper 500 error responses.

- Integrated pytest-watch into the testing pipeline. - Configured pytest-watch to automatically re-run tests when code changes are detected. - Updated the `conftest.py` file to include necessary fixtures for pytest-watch.

- Created a new directory called `books_metadata` to store book metadata. - Added basic structure and organization for storing metadata files.

- Moved Redis book_cache initialization to FastAPI lifespan event. - Ensured book_cache is always set in `app.state` to prevent AttributeError. - Improved error handling for Redis connection failures, logging issues instead of crashing. - Updated `/healthcheck/redis` to check if book_cache exists before accessing Redis. - Fixed test cases to handle scenarios where Redis is unavailable. - Ensured application startup properly initializes all required dependencies.

- Refactored `app/api/routes.py` to improve request handling and optimize API structure. - Updated `app/main.py` to ensure proper service initialization and dependency management. - Removed deprecated services (`book_processor.py`, `generate_embedding.py`, `preprocess_books.py`) to consolidate preprocessing and embedding generation. - Optimized `app/services/semantic_search.py` for efficient retrieval and similarity computation.

These files will be used to build a book corpus designed for information retrieval and semantic search in the pipeline. - Added `books.json` to store structured book data categorized by subject. - Added `book_metadata.json` containing preprocessed book records for indexing. - Added `book_embeddings.json` to store vector embeddings for book metadata.

- Added `extract_subjects()` to scrape available book subjects from Open Library. - Implemented `extract_books()` to retrieve book metadata categorized by subject. - Introduced `fetch_book_metadata()` to fetch detailed metadata for a given book using its work ID. - Saves extracted metadata in structured JSON format for indexing.

…reprocessing - Added `normalize_text()` to clean and standardize text by lowercasing, removing special characters, and applying tokenization, lemmatization, and stopword removal. - Implemented `normalize_subjects()` to ensure consistency in book subject categorization. - Introduced `preprocess_book_record()` to normalize book metadata, handling title, author, subjects, and publication year. - Developed `format_book_for_embedding()` to structure metadata for vector embedding generation. - Integrated logging for improved traceability during preprocessing. - Ensured preprocessed data is saved in `book_metadata.json` for downstream indexing and embedding.

…k metadata and embeddings - Added `app/pipelines/load.py` to manage data loading and saving operations for book metadata and embeddings. - Implemented `load_json_file()` to read structured data from JSON files. - Introduced `save_subjects_metadata()` to store extracted subject lists in JSON format. - Developed `save_book_metadata()` to save structured book metadata for downstream processing. - Implemented `save_book_embeddings()` to store vectorized book embeddings for retrieval. - Added `load_book_embeddings()` to load stored embeddings as NumPy arrays. - Created `load_book_metadata()` to read book metadata for semantic search indexing.

…ainer runtime

ddayto21 added 30 commits February 5, 2025 17:55

chore: install transformers module

e09d90d

test: add LLMClient integration tests

4eb080c

- Introduce a new test suite in app/tests/clients/test_llm_client.py - Exercises the LLMClient with real pipeline calls (no mocks) - Verifies end-to-end functionality for NLP cleanup, entity extraction, zero-shot classification, and text generation

fix(api): refactor main route logic

8222b82

refactor(backend): llm client

8f54227

refactor(api): remove LLMWorker and update routes

f1d348d

- Deleted app/services/llm_worker.py and its tests as they are no longer needed. - Updated app/api/routes.py to use the simplified LLMClient directly without offloading to a worker.

chore(deps): update pyproject.toml and poetry.lock

881661a

- Updated dependency versions and lockfile to reflect recent changes and improvements.

chore(docs): setup development environment instructions

3d40687

chore(deps): add asgi-lifespan dependency

5a82955

- Updated pyproject.toml and poetry.lock to include asgi-lifespan for managing FastAPI lifespan events. - Modified pytest.ini to configure async fixture loop scope for proper lifespan handling.

test: update pytest config for asyncio fixtures

bb4ba99

- Set asyncio_default_fixture_loop_scope to "function" in pytest.ini to address deprecation warnings. - Verified that all unit and integration tests pass with asgi-lifespan and updated event handling.

fix(build): add build-essential and gcc to Dockerfile to resolve blis…

f9d2d98

… build error - Updated the Dockerfile to install build-essential and gcc. - This enables compilation of C extensions (e.g. blis) on the slim Python image.

test(app): update fixtures to use LifespanManager for startup/shutdown

67c9f31

- Removed manual on_startup/on_shutdown calls in test fixtures. - Added LifespanManager to handle FastAPI’s lifespan context. - Ensured each test uses an ASGITransport-based AsyncClient for in-process requests.

chore(backend): remove unused dependencies

dc84a7b

feat(search): calculate semantic similarity between search query and …

d345528

…book embeddings

feat(rag-pipeline): generate vector embedding model

5137591

chore(poetry): update compatible dependencies

27e4d97

chore(backend): remove unused modules

b3490c6

test(semantic-search): create vector embeddings

c3d0f47

- Ensures that create_vector_embedding correctly generates embeddings while handling different text cases.

ddayto21 added 18 commits February 9, 2025 20:52

chore(schema): refine SearchResponse schema

4cd4bcb

feat(tests): pytest-watch integration

1697b85

- Integrated pytest-watch into the testing pipeline. - Configured pytest-watch to automatically re-run tests when code changes are detected. - Updated the `conftest.py` file to include necessary fixtures for pytest-watch.

chore(data): add books metadata

5cab24c

- Created a new directory called `books_metadata` to store book metadata. - Added basic structure and organization for storing metadata files.

chore(book-corpus): remove deprecated files

acbd5d3

fix(poetry): resolve dependency issue that occured during docker cont…

c055c97

…ainer runtime

fix(poetry): transition away from poetry

55ed853

chore(docs): update commands to setup development environment

7bcf7e2

ddayto21 merged commit 1436c18 into main Feb 10, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): integrate semantic search into /search_books endpoint #20

feat(api): integrate semantic search into /search_books endpoint #20

ddayto21 commented Feb 10, 2025

feat(api): integrate semantic search into /search_books endpoint #20

feat(api): integrate semantic search into /search_books endpoint #20

Conversation

ddayto21 commented Feb 10, 2025