Skip to content

Commit

Permalink
cleanup and add CI workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
Kenzie Mihardja committed Dec 4, 2023
1 parent ac4c637 commit 58e29d9
Show file tree
Hide file tree
Showing 18 changed files with 229 additions and 72 deletions.
99 changes: 99 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Contributing to Docugami

Hi there! Thank you for even being interested in contributing to Docugami's dgml-utils.
As an open-source project in a rapidly developing field, we are extremely open to contributions, whether they involve new features, improved infrastructure, better documentation, or bug fixes.

## 🗺️ Guidelines

### 👩‍💻 Contributing Code

To contribute to this project, please follow the ["fork and pull request"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) workflow.
Please do not try to push directly to this repo unless you are a maintainer.

Please follow the checked-in pull request template when opening pull requests. Note related issues and tag relevant
maintainers.

Pull requests cannot land without passing the formatting, linting, and testing checks first. See [Testing](#testing) and
[Formatting and Linting](#formatting-and-linting) for how to run these checks locally.

If there's something you'd like to add or change, opening a pull request is the
best way to get our attention.

### 🚩GitHub Issues

Our [issues](https://github.com/docugami/dgml-utils/issues) page is kept up to date with bugs, improvements, and feature requests.

If you start working on an issue, please assign it to yourself.

If you are adding an issue, please try to keep it focused on a single, modular bug/improvement/feature.
If two issues are related, or blocking, please link them rather than combining them.

We will try to keep these issues as up-to-date as possible, though
with the rapid rate of development in this field some may get out of date.
If you notice this happening, please let us know.

### 🙋Getting Help

Our goal is to have the simplest developer setup possible. Should you experience any difficulty getting setup, please
contact a maintainer! Not only do we want to help get you unblocked, but we also want to make sure that the process is
smooth for future contributors.

In a similar vein, we do enforce certain linting, formatting, and documentation standards in the codebase.
If you are finding these difficult (or even just annoying) to work with, feel free to contact a maintainer for help -
we do not want these to get in the way of getting good code into the codebase.

### Local Development Dependencies

Install dgml-utils development requirements (for running dgml-utils, running examples, linting, formatting, tests, and coverage):

```bash
poetry install
```

Then verify dependency installation:

```bash
make test
```

### Testing

Unit tests cover modular logic that does not require calls to outside APIs.
If you add new logic, please add a unit test.

To run unit tests:

```bash
make test
```

### Formatting and Linting

Run these locally before submitting a PR; the CI system will check also.

#### Code Formatting

Formatting for this project is done via [ruff](https://docs.astral.sh/ruff/rules/).

To run formatting for docs, cookbook and templates:

```bash
make format
```

#### Linting

Linting for this project is done via a combination of [ruff](https://docs.astral.sh/ruff/rules/) and [mypy](http://mypy-lang.org/).

To run linting for docs, cookbook and templates:

```bash
make lint
```

We recognize linting can be annoying - if you do not want to do it, please contact a project maintainer, and they can help you with it. We do not want this to be a blocker for good code getting contributed.

## 🏭 Release Process

As of now, Docugami has an ad hoc release process: releases are cut with high frequency by
a developer and published to [PyPI](https://pypi.org/project/dgml-utils/).
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]

**Additional context**
Add any other context about the problem here.
20 changes: 20 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
15 changes: 15 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<!-- Thank you for contributing to Docugami's dgml-utils!
Replace this entire comment with:
- **Description:** a description of the change,
- **Issue:** the issue # it fixes (if applicable),
- **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below),
Please make sure your PR is passing linting before submitting. Run `make lint` to check this locally.
See contribution guidelines for more information on how to write/run tests, lint, etc:
https://github.com/docugami/dgml-utils/tree/main/.github/CONTRIBUTING.md
If no one reviews your PR within a few days, please @-mention one of @tjaffri, @kenzie28.
-->
31 changes: 31 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: CI

on: [push]

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Check out the code
uses: actions/checkout@v3

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
shell: bash

- name: Install dependencies
working-directory: python
run: poetry install

- name: Lint code
working-directory: python
run: make lint

- name: Check PR status
run: |
if [ -n "$(git diff --name-only ${{ github.base_ref }}..${{ github.head_ref }})" ]; then
echo "Changes detected. Please make sure to push all changes to the branch before merging.";
exit 1;
fi
7 changes: 7 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
format:
poetry run black .

lint:
poetry run ruff check .
poetry run black --check .
poetry run npx pyright .
4 changes: 1 addition & 3 deletions app/server.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
import os
import sys
from fastapi import FastAPI
from langserve import add_routes
from docugami_kg_rag.chain import chain as docugami_kg_rag_chain
import subprocess

app = FastAPI()

add_routes(app, docugami_kg_rag_chain, path="/docugami-kg-rag")

if __name__ == "__main__":
import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8000)
10 changes: 3 additions & 7 deletions packages/docugami-kg-rag/docugami_kg_rag/chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,16 +65,12 @@ def _format_chat_history(chat_history: List[Tuple[str, str]]):
{
"input": lambda x: x["input"], # type: ignore
"chat_history": lambda x: _format_chat_history(x["chat_history"]), # type: ignore
"agent_scratchpad": lambda x: format_to_openai_functions(
x["intermediate_steps"]
), # type: ignore
"agent_scratchpad": lambda x: format_to_openai_functions(x["intermediate_steps"]), # type: ignore
"functions": lambda x: [
format_tool_to_openai_function(tool)
for tool in (
docset_retrieval_tools + report_retrieval_tools
if x["use_reports"]
else docset_retrieval_tools
) # type: ignore
docset_retrieval_tools + report_retrieval_tools if x["use_reports"] else docset_retrieval_tools # type: ignore
)
],
}
)
Expand Down
8 changes: 2 additions & 6 deletions packages/docugami-kg-rag/docugami_kg_rag/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,10 @@
CHROMA_DIRECTORY = "/tmp/chroma_db"
os.makedirs(Path(CHROMA_DIRECTORY).parent, exist_ok=True)

INDEXING_LOCAL_STATE_PATH = os.environ.get(
"INDEXING_LOCAL_STATE_PATH", "/tmp/indexing_local_state.pkl"
)
INDEXING_LOCAL_STATE_PATH = os.environ.get("INDEXING_LOCAL_STATE_PATH", "/tmp/indexing_local_state.pkl")
os.makedirs(Path(INDEXING_LOCAL_STATE_PATH).parent, exist_ok=True)

INDEXING_LOCAL_REPORT_DBS_ROOT = os.environ.get(
"INDEXING_LOCAL_REPORT_DBS_ROOT", "/tmp/report_dbs"
)
INDEXING_LOCAL_REPORT_DBS_ROOT = os.environ.get("INDEXING_LOCAL_REPORT_DBS_ROOT", "/tmp/report_dbs")
os.makedirs(Path(INDEXING_LOCAL_REPORT_DBS_ROOT).parent, exist_ok=True)

LOCAL_LLM_CACHE_DB_FILE = os.environ.get("LOCAL_LLM_CACHE", "/tmp/.langchain.db")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,7 @@ def build_summary_mappings(docs_by_id: Dict[str, Document]) -> Dict[str, str]:
# build summaries for all the given documents

summaries: Dict[str, str] = {}
format = (
"text"
if not INCLUDE_XML_TAGS
else "semantic XML without any namespaces or attributes"
)
format = "text" if not INCLUDE_XML_TAGS else "semantic XML without any namespaces or attributes"

# Splitting the documents into batches
doc_items = list(docs_by_id.items())
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,7 @@ class FusedSummaryRetriever(BaseRetriever):
search_type: SearchType = SearchType.similarity
"""Type of search to perform (similarity / mmr)"""

def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
def _get_relevant_documents(self, query: str, *, run_manager: CallbackManagerForRetrieverRun) -> List[Document]:
"""Get documents relevant to a query.
Args:
query: String to find relevant documents for
Expand All @@ -71,9 +69,7 @@ def _get_relevant_documents(
List of relevant documents
"""
if self.search_type == SearchType.mmr:
sub_docs = self.vectorstore.max_marginal_relevance_search(
query, **self.search_kwargs
)
sub_docs = self.vectorstore.max_marginal_relevance_search(query, **self.search_kwargs)
else:
sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)

Expand Down Expand Up @@ -107,9 +103,7 @@ def _get_relevant_documents(

fused_docs: List[Document] = []
for element in sorted(fused_doc_elements.values(), key=lambda x: x.rank):
fragments_str = "\n\n".join(
[d.page_content.strip() for d in element.fragments]
)
fragments_str = "\n\n".join([d.page_content.strip() for d in element.fragments])
fused_docs.append(
Document(
page_content=DOCUMENT_SUMMARY_TEMPLATE.format(
Expand Down
10 changes: 3 additions & 7 deletions packages/docugami-kg-rag/docugami_kg_rag/helpers/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,7 @@ def update_local_index(docset_id: str, name: str, parents_by_id: Dict[str, Docum
doc_summaries_by_id_store.mset(list(doc_summaries.items()))

direct_tool_function_name = docset_name_to_direct_retriever_tool_function_name(name)
direct_tool_description = chunks_to_direct_retriever_tool_description(
name, list(parents_by_id.values())
)
direct_tool_description = chunks_to_direct_retriever_tool_description(name, list(parents_by_id.values()))
report_details = build_report_details(docset_id)

doc_index_state = LocalIndexState(
Expand All @@ -74,9 +72,7 @@ def populate_chroma_index(docset_id: str, chunks: List[Document]):
print(f"Creating index for {docset_id}...")

# Reset the collection
chroma = Chroma.from_documents(
chunks, EMBEDDINGS, persist_directory=CHROMA_DIRECTORY
)
chroma = Chroma.from_documents(chunks, EMBEDDINGS, persist_directory=CHROMA_DIRECTORY)
chroma.persist()

print(f"Done embedding documents to chroma collection {docset_id}!")
Expand Down Expand Up @@ -105,7 +101,7 @@ def index_docset(docset_id: str, name: str):
parents_by_id: Dict[str, Document] = {}
children_by_id: Dict[str, Document] = {}
for chunk in chunks:
chunk_id = chunk.metadata.get("id")
chunk_id = str(chunk.metadata.get("id"))
parent_chunk_id = chunk.metadata.get(loader.parent_id_key)
if not parent_chunk_id:
# parent chunk
Expand Down
26 changes: 6 additions & 20 deletions packages/docugami-kg-rag/docugami_kg_rag/helpers/reports.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,7 @@ def download_project_latest_xlsx(project_url: str, local_xlsx: Path) -> Optional
if response.ok:
response_json = response.json()["artifacts"]
xlsx_artifact = next(
(
item
for item in response_json
if str(item["name"]).lower().endswith(".xlsx")
),
(item for item in response_json if str(item["name"]).lower().endswith(".xlsx")),
None,
)
if xlsx_artifact:
Expand Down Expand Up @@ -102,9 +98,7 @@ def report_details_to_report_query_tool_description(name: str, table_info: str)
return description[:2048] # cap to avoid failures when the description is too long


def excel_to_sqlite_connection(
file_path: Union[Path, str], table_name: str
) -> sqlite3.Connection:
def excel_to_sqlite_connection(file_path: Union[Path, str], table_name: str) -> sqlite3.Connection:
# Create a temporary SQLite database in memory
conn = sqlite3.connect(":memory:")

Expand Down Expand Up @@ -155,12 +149,8 @@ def build_report_details(docset_id: str) -> List[ReportDetails]:
id=project.id,
name=report_name,
local_xlsx_path=local_xlsx_path,
retrieval_tool_function_name=report_name_to_report_query_tool_function_name(
project.name
),
retrieval_tool_description=report_details_to_report_query_tool_description(
project.name, table_info
),
retrieval_tool_function_name=report_name_to_report_query_tool_function_name(project.name),
retrieval_tool_description=report_details_to_report_query_tool_description(project.name, table_info),
)
)

Expand All @@ -171,14 +161,10 @@ def get_retrieval_tool_for_report(report_details: ReportDetails) -> Optional[Bas
if not report_details.local_xlsx_path:
return None

conn = excel_to_sqlite_connection(
report_details.local_xlsx_path, report_details.name
)
conn = excel_to_sqlite_connection(report_details.local_xlsx_path, report_details.name)
db = connect_to_db(conn)
toolkit = SQLDatabaseToolkit(db=db, llm=LLM)
agent = create_sql_agent(
llm=LLM, toolkit=toolkit, agent_type=AgentType.OPENAI_FUNCTIONS
)
agent = create_sql_agent(llm=LLM, toolkit=toolkit, agent_type=AgentType.OPENAI_FUNCTIONS)

return Tool.from_function(
func=agent.run,
Expand Down
Loading

0 comments on commit 58e29d9

Please sign in to comment.