-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cohere embedding #2
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have added a couple of comments with questions to better understand the process. Probably not the best person to approve this but happy to do so if you want to get this merged.
added_docs += len(batch) | ||
print("added batch", i + 1) | ||
break | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, is there one particular type of error that causes this process to fail?
def delete_duplicate_urls_from_store(vectorstore): | ||
"""Looks for duplicate source urls in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped""" | ||
def delete_duplicate_chunks_from_store(vectorstore): | ||
"""Looks for duplicate source urls and text chunks in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in thinking the process follows these steps:
- scrape website
- upload new scrape to vectorstore
- check for matching chunks in vector store
- delete old chunks if a new chunk has been found.
Are the errors you encounter when uploading documents to the vector store the main reason you don't: delete all entries in the vector store from the old scrape of the same website before uploading the new scrape?
# remove markdown index links on all the content | ||
document.page_content = remove_markdown_index_links(document.page_content) | ||
|
||
split_long_document_list = text_splitter.split_documents(list_of_too_long_docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may not need to separate the long/short documents. If you set the chunk size param in RecursiveCharacterTextSplitter
relative to max_tokens then all the shorter docs will just be skipped. Although this would require you to apply remove_makrdown_index_links
to all documents.
No description provided.