-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Score directories based on the keywords searched #264
Conversation
@ajeetdsouza Do you know if some people have really massive directory databases? I optimized this more than might strictly be necessary if it's unlikely anyone has more than a couple hundred directories. Also related to that: if you don't like the extra complexity of adding a new iterator class and delaying sorting, I can do away with that by sorting in the constructor as before, either by sorting again when the parameters and excludes change, or by removing methods like |
Optimization would be great, but we should focus on finalizing the algorithm first. I haven't had time to look at this PR yet, but hopefully I'll be able to check it out soon. On a related note, I tried searching for how similar tools handle a autojump z Since we're already analyzing the querying algorithm, here's another suggestion from #263: for a user named |
8325b84
to
9176634
Compare
9899728
to
ba7b445
Compare
@ajeetdsouza What would it take for you to accept this pull request? Simpler code, smaller changes? I suggest this change is a key step toward improving the algorithm--improvement can take a faster pace after the search string influences the score, but after that step a lot of tweaking is possible. |
This prevents keywords or options from being added after sorting is done. So ajeetdsouza#260 can be implemented more safely.
Keyword-based scoring is currently a noop. Directory filtering is done before scoring, except for a mutating filter that's complex to execute earlier. This is a step toward implemnting ajeetdsouza#260.
I'm using a helper library to implement a unicode algorithm, but I'm also detecting case changes within a word (from lower to upper case, or no case to some case, so the "o" in "Documents" doesn't count as a new word). Words are not searched--rather, the string is searched starting at a word boundary. That way multi-word sequences will correctly match. This is a basic solution to ajeetdsouza#260. Some things to consider: - We don't have options to control the case. If smart-case is disabled, the weights in compute_kw_score need to change. - Right now keyword score totally overpowers the frequency score. The frequency score is only a tie-breaker. They could be normalized and weighted so a much better frequency score would win despite a slightly worse keyword score. - Should we detect word endings for exact matches? I'm not sure it would give a good user experience. If I frequently access "src9" but not "src", I don't want "src" to win just because it more exactly matches what I typed. It's hard to refrain from typing a whole word. This gives an interesting wrong result with "c" being a perfect match for "/mnt/c/anything". - The above issue can be solved if we consider digits to start a new word. - I'm testing these changes with `cargo r -- query --list --score b`.
…er than other matches.
So when searching 'd', ~/documents matches better than ~/my-documents.
A future option should be to turn off smart case, but that's not a priority, for the reasons mentioned here: ajeetdsouza#224
BTW, this will obviously still need tweaking after it's released. There's no way to make it perfect at the first go without a lot of beta testers. To tweak, I'm thinking we could add to add this to the help string:
|
Hey @lefth, sorry for the delay, I've been having a very busy month! I took a look at the changes, I put in my thoughts below. I don't think there's any need to change the
I think that requiring the last component of the path to match the last component of the query works really well, and should not be removed. If you really want to match the highest ranked subdirectory under Part of this discussion was inspired by the fact that there is no way in zoxide (or in fact, any other autojumper) to cd into The one way I can think of is to improve the interactive search feature to the point that it becomes a seamless part of one's workflow. This is why I tried to prioritize #257 -- I wanted a user to simply press I've mentioned this elsewhere, but one of the things I really like about zoxide is the predictability of the algorithm -- even when it's wrong, it's usually not hard to understand why and adapt the query accordingly. The more complex the algorithm becomes, the harder it becomes for people to understand why zoxide picked the directory it did, and the more frustrating it gets when zoxide jumps to the wrong directory. Ideally, I'd think a good query algorithm would work the same way the current algorithm does, but would assign some weight (in descending order) to-
Still more ambitious is smartcase and unicode normalization. After this, we could sum up the weights and find the one with the highest score. I'm not sure how this would be implemented, but what do you think of the general idea? |
d365bfd
to
58430d8
Compare
b2f049d
to
fd088b4
Compare
de16c49
to
1736571
Compare
dc7b300
to
3620189
Compare
79229c4
to
3df60eb
Compare
Closing due to inactivity, feel free to open an issue to discuss this further. |
Use word boundary detection and give scores based on keyword matching
I'm using a helper library to implement a unicode algorithm, but I'm also
detecting case changes within a word (from lower to upper case, or no case
to some case, so the "o" in "Documents" doesn't count as a new word).
Words are not searched--rather, the string is searched starting at a word
boundary. That way multi-word sequences will correctly match.
This is a basic solution to #260. There's lots of room for improving the algorithm,
but IMO this is a big UX improvement already.
Some things to consider:
the weights in compute_kw_score need to change.
frequency score is only a tie-breaker. They could be normalized and
weighted so a much better frequency score would win despite a slightly
worse keyword score.
give a good user experience. If I frequently access "src9" but not "src",
I don't want "src" to win just because it more exactly matches what I
typed. It's hard to refrain from typing a whole word. This gives an
interesting wrong result with "c" being an ideal match for
"/mnt/c/anything".
cargo r -- query --list --score b
.