Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cohere embedding #2

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

AndreasThinks
Copy link
Contributor

No description provided.

Copy link
Contributor

@alexmoore-iai alexmoore-iai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have added a couple of comments with questions to better understand the process. Probably not the best person to approve this but happy to do so if you want to get this merged.

added_docs += len(batch)
print("added batch", i + 1)
break
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, is there one particular type of error that causes this process to fail?

def delete_duplicate_urls_from_store(vectorstore):
"""Looks for duplicate source urls in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped"""
def delete_duplicate_chunks_from_store(vectorstore):
"""Looks for duplicate source urls and text chunks in the Opensearch vectorstore, and removes them, keeping only the most recent based on metadata.time_scraped"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking the process follows these steps:

  • scrape website
  • upload new scrape to vectorstore
  • check for matching chunks in vector store
  • delete old chunks if a new chunk has been found.

Are the errors you encounter when uploading documents to the vector store the main reason you don't: delete all entries in the vector store from the old scrape of the same website before uploading the new scrape?

# remove markdown index links on all the content
document.page_content = remove_markdown_index_links(document.page_content)

split_long_document_list = text_splitter.split_documents(list_of_too_long_docs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may not need to separate the long/short documents. If you set the chunk size param in RecursiveCharacterTextSplitter relative to max_tokens then all the shorter docs will just be skipped. Although this would require you to apply remove_makrdown_index_links to all documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants