diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100644 index 0000000..641c284 --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,99 @@ +# Contributing to Docugami + +Hi there! Thank you for even being interested in contributing to Docugami's dgml-utils. +As an open-source project in a rapidly developing field, we are extremely open to contributions, whether they involve new features, improved infrastructure, better documentation, or bug fixes. + +## 🗺️ Guidelines + +### 👩‍💻 Contributing Code + +To contribute to this project, please follow the ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow. +Please do not try to push directly to this repo unless you are a maintainer. + +Please follow the checked-in pull request template when opening pull requests. Note related issues and tag relevant +maintainers. + +Pull requests cannot land without passing the formatting, linting, and testing checks first. See [Testing](#testing) and +[Formatting and Linting](#formatting-and-linting) for how to run these checks locally. + +If there's something you'd like to add or change, opening a pull request is the +best way to get our attention. + +### 🚩GitHub Issues + +Our [issues](https://github.com/docugami/dgml-utils/issues) page is kept up to date with bugs, improvements, and feature requests. + +If you start working on an issue, please assign it to yourself. + +If you are adding an issue, please try to keep it focused on a single, modular bug/improvement/feature. +If two issues are related, or blocking, please link them rather than combining them. + +We will try to keep these issues as up-to-date as possible, though +with the rapid rate of development in this field some may get out of date. +If you notice this happening, please let us know. + +### 🙋Getting Help + +Our goal is to have the simplest developer setup possible. Should you experience any difficulty getting setup, please +contact a maintainer! Not only do we want to help get you unblocked, but we also want to make sure that the process is +smooth for future contributors. + +In a similar vein, we do enforce certain linting, formatting, and documentation standards in the codebase. +If you are finding these difficult (or even just annoying) to work with, feel free to contact a maintainer for help - +we do not want these to get in the way of getting good code into the codebase. + +### Local Development Dependencies + +Install dgml-utils development requirements (for running dgml-utils, running examples, linting, formatting, tests, and coverage): + +```bash +poetry install +``` + +Then verify dependency installation: + +```bash +make test +``` + +### Testing + +Unit tests cover modular logic that does not require calls to outside APIs. +If you add new logic, please add a unit test. + +To run unit tests: + +```bash +make test +``` + +### Formatting and Linting + +Run these locally before submitting a PR; the CI system will check also. + +#### Code Formatting + +Formatting for this project is done via [ruff](https://docs.astral.sh/ruff/rules/). + +To run formatting for docs, cookbook and templates: + +```bash +make format +``` + +#### Linting + +Linting for this project is done via a combination of [ruff](https://docs.astral.sh/ruff/rules/) and [mypy](http://mypy-lang.org/). + +To run linting for docs, cookbook and templates: + +```bash +make lint +``` + +We recognize linting can be annoying - if you do not want to do it, please contact a project maintainer, and they can help you with it. We do not want this to be a blocker for good code getting contributed. + +## 🏭 Release Process + +As of now, Docugami has an ad hoc release process: releases are cut with high frequency by +a developer and published to [PyPI](https://pypi.org/project/dgml-utils/). diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 0000000..2fcf1c6 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,32 @@ +--- +name: Bug report +about: Create a report to help us improve +title: '' +labels: '' +assignees: '' + +--- + +**Describe the bug** +A clear and concise description of what the bug is. + +**To Reproduce** +Steps to reproduce the behavior: +1. Go to '...' +2. Click on '....' +3. Scroll down to '....' +4. See error + +**Expected behavior** +A clear and concise description of what you expected to happen. + +**Screenshots** +If applicable, add screenshots to help explain your problem. + +**Desktop (please complete the following information):** + - OS: [e.g. iOS] + - Browser [e.g. chrome, safari] + - Version [e.g. 22] + +**Additional context** +Add any other context about the problem here. diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 0000000..bbcbbe7 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,20 @@ +--- +name: Feature request +about: Suggest an idea for this project +title: '' +labels: '' +assignees: '' + +--- + +**Is your feature request related to a problem? Please describe.** +A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] + +**Describe the solution you'd like** +A clear and concise description of what you want to happen. + +**Describe alternatives you've considered** +A clear and concise description of any alternative solutions or features you've considered. + +**Additional context** +Add any other context or screenshots about the feature request here. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000..7b7555e --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,15 @@ + \ No newline at end of file diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..d36b092 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,31 @@ +name: CI + +on: [push] + +jobs: + build: + runs-on: ubuntu-latest + + steps: + - name: Check out the code + uses: actions/checkout@v3 + + - name: Install Poetry + run: | + curl -sSL https://install.python-poetry.org | python3 - + shell: bash + + - name: Install dependencies + working-directory: python + run: poetry install + + - name: Lint code + working-directory: python + run: make lint + + - name: Check PR status + run: | + if [ -n "$(git diff --name-only ${{ github.base_ref }}..${{ github.head_ref }})" ]; then + echo "Changes detected. Please make sure to push all changes to the branch before merging."; + exit 1; + fi diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..a950dd0 --- /dev/null +++ b/Makefile @@ -0,0 +1,7 @@ +format: + poetry run black . + +lint: + poetry run ruff check . + poetry run black --check . + poetry run npx pyright . diff --git a/app/server.py b/app/server.py index 0e62e8d..b8c5d9a 100644 --- a/app/server.py +++ b/app/server.py @@ -1,9 +1,6 @@ -import os -import sys from fastapi import FastAPI from langserve import add_routes from docugami_kg_rag.chain import chain as docugami_kg_rag_chain -import subprocess app = FastAPI() @@ -11,4 +8,5 @@ if __name__ == "__main__": import uvicorn + uvicorn.run(app, host="0.0.0.0", port=8000) diff --git a/packages/docugami-kg-rag/docugami_kg_rag/chain.py b/packages/docugami-kg-rag/docugami_kg_rag/chain.py index 2f0bafa..25a3144 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/chain.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/chain.py @@ -65,16 +65,12 @@ def _format_chat_history(chat_history: List[Tuple[str, str]]): { "input": lambda x: x["input"], # type: ignore "chat_history": lambda x: _format_chat_history(x["chat_history"]), # type: ignore - "agent_scratchpad": lambda x: format_to_openai_functions( - x["intermediate_steps"] - ), # type: ignore + "agent_scratchpad": lambda x: format_to_openai_functions(x["intermediate_steps"]), # type: ignore "functions": lambda x: [ format_tool_to_openai_function(tool) for tool in ( - docset_retrieval_tools + report_retrieval_tools - if x["use_reports"] - else docset_retrieval_tools - ) # type: ignore + docset_retrieval_tools + report_retrieval_tools if x["use_reports"] else docset_retrieval_tools # type: ignore + ) ], } ) diff --git a/packages/docugami-kg-rag/docugami_kg_rag/config.py b/packages/docugami-kg-rag/docugami_kg_rag/config.py index 4712bd2..c4050fd 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/config.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/config.py @@ -20,14 +20,10 @@ CHROMA_DIRECTORY = "/tmp/chroma_db" os.makedirs(Path(CHROMA_DIRECTORY).parent, exist_ok=True) -INDEXING_LOCAL_STATE_PATH = os.environ.get( - "INDEXING_LOCAL_STATE_PATH", "/tmp/indexing_local_state.pkl" -) +INDEXING_LOCAL_STATE_PATH = os.environ.get("INDEXING_LOCAL_STATE_PATH", "/tmp/indexing_local_state.pkl") os.makedirs(Path(INDEXING_LOCAL_STATE_PATH).parent, exist_ok=True) -INDEXING_LOCAL_REPORT_DBS_ROOT = os.environ.get( - "INDEXING_LOCAL_REPORT_DBS_ROOT", "/tmp/report_dbs" -) +INDEXING_LOCAL_REPORT_DBS_ROOT = os.environ.get("INDEXING_LOCAL_REPORT_DBS_ROOT", "/tmp/report_dbs") os.makedirs(Path(INDEXING_LOCAL_REPORT_DBS_ROOT).parent, exist_ok=True) LOCAL_LLM_CACHE_DB_FILE = os.environ.get("LOCAL_LLM_CACHE", "/tmp/.langchain.db") diff --git a/packages/docugami-kg-rag/docugami_kg_rag/helpers/documents.py b/packages/docugami-kg-rag/docugami_kg_rag/helpers/documents.py index 27e5d41..f821528 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/helpers/documents.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/helpers/documents.py @@ -21,11 +21,7 @@ def build_summary_mappings(docs_by_id: Dict[str, Document]) -> Dict[str, str]: # build summaries for all the given documents summaries: Dict[str, str] = {} - format = ( - "text" - if not INCLUDE_XML_TAGS - else "semantic XML without any namespaces or attributes" - ) + format = "text" if not INCLUDE_XML_TAGS else "semantic XML without any namespaces or attributes" # Splitting the documents into batches doc_items = list(docs_by_id.items()) diff --git a/packages/docugami-kg-rag/docugami_kg_rag/helpers/fused_summary_retriever.py b/packages/docugami-kg-rag/docugami_kg_rag/helpers/fused_summary_retriever.py index f1a73c2..2c4f9ba 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/helpers/fused_summary_retriever.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/helpers/fused_summary_retriever.py @@ -60,9 +60,7 @@ class FusedSummaryRetriever(BaseRetriever): search_type: SearchType = SearchType.similarity """Type of search to perform (similarity / mmr)""" - def _get_relevant_documents( - self, query: str, *, run_manager: CallbackManagerForRetrieverRun - ) -> List[Document]: + def _get_relevant_documents(self, query: str, *, run_manager: CallbackManagerForRetrieverRun) -> List[Document]: """Get documents relevant to a query. Args: query: String to find relevant documents for @@ -71,9 +69,7 @@ def _get_relevant_documents( List of relevant documents """ if self.search_type == SearchType.mmr: - sub_docs = self.vectorstore.max_marginal_relevance_search( - query, **self.search_kwargs - ) + sub_docs = self.vectorstore.max_marginal_relevance_search(query, **self.search_kwargs) else: sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs) @@ -107,9 +103,7 @@ def _get_relevant_documents( fused_docs: List[Document] = [] for element in sorted(fused_doc_elements.values(), key=lambda x: x.rank): - fragments_str = "\n\n".join( - [d.page_content.strip() for d in element.fragments] - ) + fragments_str = "\n\n".join([d.page_content.strip() for d in element.fragments]) fused_docs.append( Document( page_content=DOCUMENT_SUMMARY_TEMPLATE.format( diff --git a/packages/docugami-kg-rag/docugami_kg_rag/helpers/indexing.py b/packages/docugami-kg-rag/docugami_kg_rag/helpers/indexing.py index e5fe511..41a1adf 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/helpers/indexing.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/helpers/indexing.py @@ -48,9 +48,7 @@ def update_local_index(docset_id: str, name: str, parents_by_id: Dict[str, Docum doc_summaries_by_id_store.mset(list(doc_summaries.items())) direct_tool_function_name = docset_name_to_direct_retriever_tool_function_name(name) - direct_tool_description = chunks_to_direct_retriever_tool_description( - name, list(parents_by_id.values()) - ) + direct_tool_description = chunks_to_direct_retriever_tool_description(name, list(parents_by_id.values())) report_details = build_report_details(docset_id) doc_index_state = LocalIndexState( @@ -74,9 +72,7 @@ def populate_chroma_index(docset_id: str, chunks: List[Document]): print(f"Creating index for {docset_id}...") # Reset the collection - chroma = Chroma.from_documents( - chunks, EMBEDDINGS, persist_directory=CHROMA_DIRECTORY - ) + chroma = Chroma.from_documents(chunks, EMBEDDINGS, persist_directory=CHROMA_DIRECTORY) chroma.persist() print(f"Done embedding documents to chroma collection {docset_id}!") @@ -105,7 +101,7 @@ def index_docset(docset_id: str, name: str): parents_by_id: Dict[str, Document] = {} children_by_id: Dict[str, Document] = {} for chunk in chunks: - chunk_id = chunk.metadata.get("id") + chunk_id = str(chunk.metadata.get("id")) parent_chunk_id = chunk.metadata.get(loader.parent_id_key) if not parent_chunk_id: # parent chunk diff --git a/packages/docugami-kg-rag/docugami_kg_rag/helpers/reports.py b/packages/docugami-kg-rag/docugami_kg_rag/helpers/reports.py index a322fa4..2bb00f5 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/helpers/reports.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/helpers/reports.py @@ -34,11 +34,7 @@ def download_project_latest_xlsx(project_url: str, local_xlsx: Path) -> Optional if response.ok: response_json = response.json()["artifacts"] xlsx_artifact = next( - ( - item - for item in response_json - if str(item["name"]).lower().endswith(".xlsx") - ), + (item for item in response_json if str(item["name"]).lower().endswith(".xlsx")), None, ) if xlsx_artifact: @@ -102,9 +98,7 @@ def report_details_to_report_query_tool_description(name: str, table_info: str) return description[:2048] # cap to avoid failures when the description is too long -def excel_to_sqlite_connection( - file_path: Union[Path, str], table_name: str -) -> sqlite3.Connection: +def excel_to_sqlite_connection(file_path: Union[Path, str], table_name: str) -> sqlite3.Connection: # Create a temporary SQLite database in memory conn = sqlite3.connect(":memory:") @@ -155,12 +149,8 @@ def build_report_details(docset_id: str) -> List[ReportDetails]: id=project.id, name=report_name, local_xlsx_path=local_xlsx_path, - retrieval_tool_function_name=report_name_to_report_query_tool_function_name( - project.name - ), - retrieval_tool_description=report_details_to_report_query_tool_description( - project.name, table_info - ), + retrieval_tool_function_name=report_name_to_report_query_tool_function_name(project.name), + retrieval_tool_description=report_details_to_report_query_tool_description(project.name, table_info), ) ) @@ -171,14 +161,10 @@ def get_retrieval_tool_for_report(report_details: ReportDetails) -> Optional[Bas if not report_details.local_xlsx_path: return None - conn = excel_to_sqlite_connection( - report_details.local_xlsx_path, report_details.name - ) + conn = excel_to_sqlite_connection(report_details.local_xlsx_path, report_details.name) db = connect_to_db(conn) toolkit = SQLDatabaseToolkit(db=db, llm=LLM) - agent = create_sql_agent( - llm=LLM, toolkit=toolkit, agent_type=AgentType.OPENAI_FUNCTIONS - ) + agent = create_sql_agent(llm=LLM, toolkit=toolkit, agent_type=AgentType.OPENAI_FUNCTIONS) return Tool.from_function( func=agent.run, diff --git a/packages/docugami-kg-rag/docugami_kg_rag/helpers/retrieval.py b/packages/docugami-kg-rag/docugami_kg_rag/helpers/retrieval.py index 9b5fb0c..9db3e20 100644 --- a/packages/docugami-kg-rag/docugami_kg_rag/helpers/retrieval.py +++ b/packages/docugami-kg-rag/docugami_kg_rag/helpers/retrieval.py @@ -73,14 +73,10 @@ def chunks_to_direct_retriever_tool_description(name: str, chunks: List[Document return f"Searches for and returns chunks from {name} documents. {summary}" -def get_retrieval_tool_for_docset( - docset_id: str, docset_state: LocalIndexState -) -> Optional[BaseTool]: +def get_retrieval_tool_for_docset(docset_id: str, docset_state: LocalIndexState) -> Optional[BaseTool]: # Chunks are in the vector store, and full documents are in the store inside the local state - chunk_vectorstore = Chroma( - persist_directory=CHROMA_DIRECTORY, embedding_function=EMBEDDINGS - ) + chunk_vectorstore = Chroma(persist_directory=CHROMA_DIRECTORY, embedding_function=EMBEDDINGS) retriever = FusedSummaryRetriever( vectorstore=chunk_vectorstore, diff --git a/packages/docugami-kg-rag/index.py b/packages/docugami-kg-rag/index.py index bb240f5..8d9c540 100644 --- a/packages/docugami-kg-rag/index.py +++ b/packages/docugami-kg-rag/index.py @@ -10,9 +10,7 @@ def main(): docsets_response = docugami_client.docsets.list() if not docsets_response or not docsets_response.docsets: - raise Exception( - "The workspace corresponding to the provided DOCUGAMI_API_KEY does not have any docsets." - ) + raise Exception("The workspace corresponding to the provided DOCUGAMI_API_KEY does not have any docsets.") docsets = docsets_response.docsets @@ -27,9 +25,7 @@ def main(): selected_docsets = [d for d in docsets] else: selected_indices = [int(i.strip()) for i in user_input.split(",")] - selected_docsets = [ - docsets[idx - 1] for idx in selected_indices if 0 < idx <= len(docsets) - ] + selected_docsets = [docsets[idx - 1] for idx in selected_indices if 0 < idx <= len(docsets)] for docset in [d for d in selected_docsets if d is not None]: if not docset.id or not docset.name: diff --git a/packages/docugami-kg-rag/pyproject.toml b/packages/docugami-kg-rag/pyproject.toml index 1f16f97..b4cf845 100644 --- a/packages/docugami-kg-rag/pyproject.toml +++ b/packages/docugami-kg-rag/pyproject.toml @@ -51,7 +51,6 @@ addopts = "--doctest-modules" norecursedirs = ".venv" [tool.pyright] -include = ["dgml_utils", "tests"] ignore = ["**/node_modules", "**/__pycache__", ".venv"] reportMissingImports = true reportMissingTypeStubs = false diff --git a/poetry.lock b/poetry.lock index 5605534..cba16ca 100644 --- a/poetry.lock +++ b/poetry.lock @@ -4614,4 +4614,4 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p [metadata] lock-version = "2.0" python-versions = ">=3.9,<4.0" -content-hash = "554695ccd980c257b0878111eaf70a1e984b37a05c8946a23627df9af64f0f8a" +content-hash = "13865975d4ceda749df2e15b5763c50f2f1d49a175b98a509ebe74b4665c4c4b" diff --git a/pyproject.toml b/pyproject.toml index 921f467..c82e8bb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -20,6 +20,7 @@ docugami-kg-rag = {path = "packages/docugami-kg-rag", develop = true} jinja2 = "^3.1.2" typer = "^0.9.0" docugami = "^0.0.4" +black = "^23.11.0" [tool.poetry.group.dev.dependencies] @@ -48,7 +49,6 @@ addopts = "--doctest-modules" norecursedirs = ".venv" [tool.pyright] -include = ["dgml_utils", "tests"] ignore = ["**/node_modules", "**/__pycache__", ".venv"] reportMissingImports = true reportMissingTypeStubs = false