Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paragraph sequence field #118

Closed
JasonLo opened this issue Jul 25, 2024 · 5 comments
Closed

Add paragraph sequence field #118

JasonLo opened this issue Jul 25, 2024 · 5 comments
Assignees

Comments

@JasonLo
Copy link
Collaborator

JasonLo commented Jul 25, 2024

To support a downstream feature in the USGS project, we need to add a field in Weaviate to indicate paragraph order within a document. The required changes are:

  1. Redefine the Weaviate schema.
  2. Update the preprocessor to include paragraph_order.
  3. Update the ingest pipeline.
  4. Rebuild Weaviate (if necessary).

Assuming the raw text remains unchanged, we can likely skip re-embedding by using the old embeddings with paragraph_hash. this should be safe and efficient.

@JasonLo JasonLo self-assigned this Jul 25, 2024
@JasonLo
Copy link
Collaborator Author

JasonLo commented Jul 29, 2024

Discussed the update should be as follow:

Steps:

  1. Gather a master list of docid.
  2. Subset docid from Geoarchive, CriticalMASS.
  3. For each docid, call preprocessorv2 (v1 + paragraph ordering).
  4. Compare hashed_text for each paragraph. If unchanged, retrieve embedding data from existing Weaviate.
  5. If changed, drop paragraphs with the same docid and reprocess everything in it.

@iross
Copy link
Collaborator

iross commented Jul 29, 2024 via email

@JasonLo
Copy link
Collaborator Author

JasonLo commented Aug 2, 2024

UW-xDD/text2graph_llm#20

@JasonLo
Copy link
Collaborator Author

JasonLo commented Aug 2, 2024

@ilmcconnell , @iross

Started the patch for paragraph order, but it's slower than expected due to the lack of a batch update function. Estimated completion time: 4-5 days. Will check progress on 8/7.

@JasonLo
Copy link
Collaborator Author

JasonLo commented Aug 4, 2024

#119
The patch has been completed ahead of schedule.

@JasonLo JasonLo closed this as completed Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants