Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we add new domains into existing LSH indexers? #3

Open
QthCN opened this issue Jan 15, 2021 · 3 comments
Open

Can we add new domains into existing LSH indexers? #3

QthCN opened this issue Jan 15, 2021 · 3 comments

Comments

@QthCN
Copy link

QthCN commented Jan 15, 2021

Hi, I've read the original great paper 👍 and the repo's readme.md. Now I have a question: Can I add new domain records into an existing indexer?

For example if I create an indexer with 1 billion records using

index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, 
    len(domainRecords), lshensemble.Recs2Chan(domainRecords))

After the creation, I get 1 million new records again. Can I add them to the exist index_eqd? Or I can only create a new indexer with 1 billion + 1 million records.

@chrisemezue
Copy link

chrisemezue commented Jan 14, 2022

I am facing a similar issue.
With the MinHash LSH one can query and then add more hashes like this:

lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
result = lsh.query(m1)
#Pickle lsh
#Unpickle lsh later
lsh.insert("m3", m3)  #I can add more MinHash(es) later and then query
result = lsh.query(m1)

But with MinHash LSH Ensemble, you can only run .index() once as explained in the code.

I have a setup where I want to:

  1. Create an LSHEnsemble with the data i have -> Call it LSH
  2. Query for duplicates with LSH.
  3. Add more MinHash(es) to LSH -- giving me LSHnew
  4. Query with LSHnew.

How can I do this please? @ekzhu

@ekzhu
Copy link
Owner

ekzhu commented Jan 24, 2022

Hi, I've read the original great paper 👍 and the repo's readme.md. Now I have a question: Can I add new domain records into an existing indexer?

For example if I create an indexer with 1 billion records using

index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, 
    len(domainRecords), lshensemble.Recs2Chan(domainRecords))

After the creation, I get 1 million new records again. Can I add them to the exist index_eqd? Or I can only create a new indexer with 1 billion + 1 million records.

You will need to create another index for your new records. The created index itself is frozen and can't be updated.

@ekzhu
Copy link
Owner

ekzhu commented Jan 24, 2022

The code snippet is from datasketch Python library. For this Go library, there isn't an update option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants