Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug BM25Okapi #26

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Debug BM25Okapi #26

wants to merge 1 commit into from

Conversation

LowinLi
Copy link

@LowinLi LowinLi commented Aug 4, 2022

In the "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so the BM25 score also will be negative. So this commit will debug this error.

In "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so BM25 score also will be negative. So this commit want be debug this error.
@dorianbrown dorianbrown self-requested a review May 28, 2024 12:19
@dorianbrown
Copy link
Owner

dorianbrown commented May 28, 2024

I think I finally found where this motivation came from, namely this section from here:


Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term "the" appeared in 6 of them, its IDF would be log(10−6+0.5/6+0.5)=log(4.5/6.5).

Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.

  • Add a 1 to the equation. IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5)
  • For term that resulted in a negative IDF value, swap it with an small positive value, usually denoted as epsilon

@dorianbrown
Copy link
Owner

I wonder if it might be more simple to just go with the "smoothed" IDF function IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5), which ensures that IDFs are always positive. That way we don't have to do all this checking for negativity stuff.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants