-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concern with weight calculation using BLAST and entropies #4
Comments
Hi, The formula comes from the BLAST paper (http://www.vldb.org/pvldb/vol9/p1173-simonini.pdf). Maybe, you can also try to fine-tuning BLAST by using the chi2divider parameter, increasing it will smooth the pruning. |
Thanks for helping out, I understand the logic of high entropy being more meaningful compared to low entropy and I do believe that makes sense given real-world entities. For my test dataset, the only matchable field is the customer name with a lot of junk fields, I'm just trying to test the worse edge case scenario. The issue is, I only get one cluster with fields of An example of matched tokens are last four digits of a phone number to a snippet of a uuid. Is this just a drawback of using this approach? Maybe the edge case is unrealistic for real-world data? I've tried tuning chi2divider but that seems to just increases the returned edges count. When I select by highest weight I'm left with inaccurate results. My customer names should be a perfect 1-1 match between the two files. Maybe the clustering step should ignore completely irrelevant fields if there's way too few matches? What do you think? |
This library is pretty incredible, just have a bit of a concern I wanted to report.
My use case is as follows:
Take 2 CSV's containing customer data that should contain 1 or more fields that are matchable (an identifier for example)
customers1.csv:
customers2.csv:
I would get a few mis-matches because the weight of matches for
cluster_id
2 would be greater thancluster_id
1.Assuming there's 100 rows in each profile and 100 being the separator id (200 profiles total) the output edges would look something like:
You will notice that the higher weight goes to the match that has the higher entropy.
This doesn't seem correct to me since lower entropy should give higher weight.
Using standard library I was able to get around 80-90 perfect matches. Once I edited
calc_weights
function incommon_node_pruning.py
fromcalc_chi_square(...) * entropies[neighbor_id]
tocalc_chi_square(...) / entropies[neighbor_id]
I was able to get 100 perfect 1to1 matches.Does the division instead of multiplication here make sense, and is my assumption of lower entropy should be greater match correct?
Please let me know :)
The text was updated successfully, but these errors were encountered: