Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DAT-70] feat: Use forward and reverse tokenization #2624

Merged
merged 1 commit into from
Nov 26, 2024

Commits on Nov 26, 2024

  1. feat: Use forward and reverse tokenization

    We enable the "reverse" tokenization mode, which allow to search both in
    forward and backward directions on tokens.
    The forward mode allows to search from left to right, while the reverse
    mode allows the opposite. For example, with the word "example", you can
    search "exam" in forward mode, and "ample" in reverse.
    
    We measured a 15-20% memory impact on enabling the reverse tokenization,
    compared to the forward mode.
    The "full" mode, allowing searching on all combinations, including in
    the middle of the word, comes with ~70% memory increase. So, we decided
    to to not enable it for now, as the cost/benefit ratio of such feature
    is unclear.
    
    The memory cost of enabling the reverse mode seems however reasonable.
    paultranvan committed Nov 26, 2024
    Configuration menu
    Copy the full SHA
    393af6d View commit details
    Browse the repository at this point in the history