Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score directories based on the keywords searched #264

Closed
wants to merge 7 commits into from

Conversation

lefth
Copy link

@lefth lefth commented Sep 17, 2021

Use word boundary detection and give scores based on keyword matching

I'm using a helper library to implement a unicode algorithm, but I'm also
detecting case changes within a word (from lower to upper case, or no case
to some case, so the "o" in "Documents" doesn't count as a new word).

Words are not searched--rather, the string is searched starting at a word
boundary. That way multi-word sequences will correctly match.

This is a basic solution to #260. There's lots of room for improving the algorithm,
but IMO this is a big UX improvement already.

Some things to consider:

  • We don't have options to control the case. If smart-case is disabled,
    the weights in compute_kw_score need to change.
  • Right now keyword score totally overpowers the frequency score. The
    frequency score is only a tie-breaker. They could be normalized and
    weighted so a much better frequency score would win despite a slightly
    worse keyword score.
  • Should we detect word endings for exact matches? I'm not sure it would
    give a good user experience. If I frequently access "src9" but not "src",
    I don't want "src" to win just because it more exactly matches what I
    typed. It's hard to refrain from typing a whole word. This gives an
    interesting wrong result with "c" being an ideal match for
    "/mnt/c/anything".
  • The above issue can be mitigated but not solved if we consider digits to start a new word.
  • I'm testing these changes with a command like cargo r -- query --list --score b.

@lefth
Copy link
Author

lefth commented Sep 25, 2021

@ajeetdsouza Do you know if some people have really massive directory databases? I optimized this more than might strictly be necessary if it's unlikely anyone has more than a couple hundred directories.

Also related to that: if you don't like the extra complexity of adding a new iterator class and delaying sorting, I can do away with that by sorting in the constructor as before, either by sorting again when the parameters and excludes change, or by removing methods like with_keywords and passing that data to the constructor instead.

@ajeetdsouza
Copy link
Owner

ajeetdsouza commented Sep 26, 2021

I optimized this more than might strictly be necessary

Optimization would be great, but we should focus on finalizing the algorithm first. I haven't had time to look at this PR yet, but hopefully I'll be able to check it out soon.


On a related note, I tried searching for how similar tools handle a z foo command with directories /foo and /foobar, since that seems to be one of the primary motives of this change.

autojump
The solution given was to manually lower the score of the /foobar directory, or re-run the z command since the current directory is excluded (wting/autojump#533).

z
The algorithm was changed to always select the shortest match if the prefix is common among all results. The author comments that the solution may have been suboptimal (rupa/z#35).
Someone else suggested going to an exact match if there is a query like z foo/, which sounds doable (rupa/z#303).


Since we're already analyzing the querying algorithm, here's another suggestion from #263: for a user named foo, z foo bar should probably prefer /home/foo/dir/foo/bar rather than /home/foo/bar. It's unlikely that the user would actually mention their username when searching for a path.

@ajeetdsouza ajeetdsouza force-pushed the main branch 3 times, most recently from 8325b84 to 9176634 Compare September 28, 2021 22:36
@ajeetdsouza ajeetdsouza mentioned this pull request Oct 17, 2021
@ajeetdsouza ajeetdsouza force-pushed the main branch 12 times, most recently from 9899728 to ba7b445 Compare November 1, 2021 16:41
@lefth
Copy link
Author

lefth commented Nov 26, 2021

@ajeetdsouza What would it take for you to accept this pull request? Simpler code, smaller changes?

I suggest this change is a key step toward improving the algorithm--improvement can take a faster pace after the search string influences the score, but after that step a lot of tweaking is possible.

This prevents keywords or options from being added after sorting is done.
So ajeetdsouza#260 can be implemented more safely.
Keyword-based scoring is currently a noop. Directory filtering is done
before scoring, except for a mutating filter that's complex to execute
earlier.

This is a step toward implemnting ajeetdsouza#260.
I'm using a helper library to implement a unicode algorithm, but I'm also
detecting case changes within a word (from lower to upper case, or no case
to some case, so the "o" in "Documents" doesn't count as a new word).

Words are not searched--rather, the string is searched starting at a word
boundary. That way multi-word sequences will correctly match.

This is a basic solution to ajeetdsouza#260.

Some things to consider:
- We don't have options to control the case. If smart-case is disabled,
  the weights in compute_kw_score need to change.
- Right now keyword score totally overpowers the frequency score. The
  frequency score is only a tie-breaker. They could be normalized and
  weighted so a much better frequency score would win despite a slightly
  worse keyword score.
- Should we detect word endings for exact matches? I'm not sure it would
  give a good user experience. If I frequently access "src9" but not "src",
  I don't want "src" to win just because it more exactly matches what I
  typed. It's hard to refrain from typing a whole word. This gives an
  interesting wrong result with "c" being a perfect match for
  "/mnt/c/anything".
- The above issue can be solved if we consider digits to start a new word.
- I'm testing these changes with `cargo r -- query --list --score b`.
So when searching 'd', ~/documents matches better than ~/my-documents.
A future option should be to turn off smart case, but that's not a priority,
for the reasons mentioned here: ajeetdsouza#224
@lefth
Copy link
Author

lefth commented Nov 27, 2021

BTW, this will obviously still need tweaking after it's released. There's no way to make it perfect at the first go without a lot of beta testers. To tweak, I'm thinking we could add to add this to the help string:

If a query takes you to a directory it shouldn't (if the smart logic for picking a directory isn't smart enough),
please comment on with this query and the first few lines of output, so we can see directory scores
and judge if the matcher needs tweaking: zoxide --version && zoxide query -ls <the query>

@ajeetdsouza
Copy link
Owner

ajeetdsouza commented Nov 28, 2021

Hey @lefth, sorry for the delay, I've been having a very busy month! I took a look at the changes, I put in my thoughts below.


I don't think there's any need to change the DirDisplayScore implementation. The purpose of including score in the output is so that users can implement custom functions in their shell configurations -- essentially, zoxide query -ls | sort should sort results by score. Thus, we should ideally change the score function to return the frecency combined with the match score.


This gives an interesting wrong result with c being an ideal match for /mnt/c/anything.

I think that requiring the last component of the path to match the last component of the query works really well, and should not be removed. If you really want to match the highest ranked subdirectory under foo, you can always use a query like z foo /.


Part of this discussion was inspired by the fact that there is no way in zoxide (or in fact, any other autojumper) to cd into /foo if there was a directory called /foobar that was more highly ranked. Are we tackling this problem in any way?

The one way I can think of is to improve the interactive search feature to the point that it becomes a seamless part of one's workflow. This is why I tried to prioritize #257 -- I wanted a user to simply press <SPACE><TAB> and get to interactive selection. This is now available on bash and fish, and I'm bringing it to zsh once I get the time.


I've mentioned this elsewhere, but one of the things I really like about zoxide is the predictability of the algorithm -- even when it's wrong, it's usually not hard to understand why and adapt the query accordingly. The more complex the algorithm becomes, the harder it becomes for people to understand why zoxide picked the directory it did, and the more frustrating it gets when zoxide jumps to the wrong directory.

Ideally, I'd think a good query algorithm would work the same way the current algorithm does, but would assign some weight (in descending order) to-

  • exact matches of components (split by /; example: component is foo and query is foo)
  • left-side matches of components (example: component is foobar and query is foo)
  • exact matches of words (split by unicode word separators)
  • left-hand matches of words
  • regular matches (the way they are now)

Still more ambitious is smartcase and unicode normalization. After this, we could sum up the weights and find the one with the highest score. I'm not sure how this would be implemented, but what do you think of the general idea?

@ajeetdsouza ajeetdsouza force-pushed the main branch 2 times, most recently from d365bfd to 58430d8 Compare December 5, 2021 09:21
@ajeetdsouza ajeetdsouza force-pushed the main branch 7 times, most recently from b2f049d to fd088b4 Compare December 24, 2021 10:50
@ajeetdsouza ajeetdsouza force-pushed the main branch 2 times, most recently from de16c49 to 1736571 Compare March 9, 2022 14:29
@ajeetdsouza ajeetdsouza force-pushed the main branch 3 times, most recently from dc7b300 to 3620189 Compare April 25, 2022 14:26
@ajeetdsouza ajeetdsouza force-pushed the main branch 3 times, most recently from 79229c4 to 3df60eb Compare June 25, 2022 19:52
@ajeetdsouza
Copy link
Owner

Closing due to inactivity, feel free to open an issue to discuss this further.

@ajeetdsouza ajeetdsouza closed this Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants