You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Results:
Index statistics
----------------
documents: 528030
documents (non-empty): 528030
unique terms: -1
total terms: 174540872
Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.
I already solved this myself: I will add a pull request.
The text was updated successfully, but these errors were encountered:
Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments.
BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly)
I want to know the number of unique terms in my index and got: -1
Steps:
IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors
IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/
Results:
Index statistics
----------------
documents: 528030
documents (non-empty): 528030
unique terms: -1
total terms: 174540872
Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.
I already solved this myself: I will add a pull request.
The text was updated successfully, but these errors were encountered: