Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to walk the trie? #20

Closed
EliFinkelshteyn opened this issue Mar 13, 2015 · 6 comments
Closed

Possible to walk the trie? #20

EliFinkelshteyn opened this issue Mar 13, 2015 · 6 comments

Comments

@EliFinkelshteyn
Copy link

I don't see any methods that would allow for just proceeding one edge in the trie, for example:

trie = marisa_trie.Trie([u'key1', u'key2', u'kite'])
trie.edges('k') #would return [u'ke', u'ki']

Using the prefixes method for something like this would be very expensive if the trie is big and the prefix is short. Is there some technical detail I'm missing for why implementing a function for this would be costly, or some other reason this isn't implemented? It seems like it's a necessary step in the traversal with the prefixes() method anyway, and quite useful for predictive lookup operations.

@kmike
Copy link
Member

kmike commented Mar 13, 2015

It is a marisa-trie C++ library limitation - see https://code.google.com/p/marisa-trie/issues/detail?id=9.

.prefixes() for Trie is implemented using a dedicated function marisa-trie provides, but .prefixes() of RecordTrie is suboptimal because of this limitation.

What is your use case? It is possible to implement .edges() function (or add a support for Iterator objects) in https://github.com/kmike/DAWG and https://github.com/kmike/DAWG-Python; these packages often provide compression rates similar to marisa-trie.

@EliFinkelshteyn
Copy link
Author

Use case is anything like fuzzy predictive search, where at each edge forward, a fuzzy comparison is made using some fuzzy distance metrics coupled with score to see if it's worth moving forward past that edge. With the current implementation, I'd need to instead proactively check to see if edges exist (so instead of trie.edges() giving ['ke','ki'], I'd be forced to check all possible next letters following "k" for existence, which gets more and more expensive with bigger alphabets).

By the way, I hope this doesn't come off as criticism. I think all your trie implementations are awesome, and looked through DAWG and dat-trie for similar functionality last night. I'll try to add this functionality myself if necessary, but wanted to check if I'm missing something first.

On Mar 13, 2015, at 4:12 AM, Mikhail Korobov [email protected] wrote:

It is a marisa-trie C++ library limitation - see https://code.google.com/p/marisa-trie/issues/detail?id=9.

.prefixes() for Trie is implemented using a dedicated function marisa-trie provides, but .prefixes() of RecordTrie is suboptimal because of this limitation.

What is your use case? It is possible to implement .edges() function (or add a support for Iterator objects) in https://github.com/kmike/DAWG and https://github.com/kmike/DAWG-Python; these packages often provide compression rates similar to marisa-trie.


Reply to this email directly or view it on GitHub.

@kmike
Copy link
Member

kmike commented Mar 13, 2015

Thanks!

datrie has State and Iterator objects which allow to implement any iteration in user code. It has some issues though - memory savings are much smaller than in DAWG/DAFSA or marisa-trie, and building of datrie can be slow for large tries (see pytries/datrie#12). I don't use datrie myself, and the development stalled because for my use cases one of DAWG, marisa-trie or hat-trie usually work better than datrie. There are also some open issues in tracker.

DAWG doesn't have a public API for custom iteration, but this API can be implemented. Check https://github.com/kmike/DAWG-Python/blob/master/dawg_python/wrapper.py and how are classes from there used in https://github.com/kmike/DAWG-Python/blob/master/dawg_python/dawgs.py (DAWG package has a very similar coe, but in Cython). Pull requests are welcome :)

Unfortunately we can't create a similar API for marisa-trie because of C++ library limitations.

I haven't added a public API for iteration to DAWG because of speed issues. Lookups are going to be much slower if iteration is done in Python instead of Cython, that's why there are many optimized methods for common use cases instead of a couple of generic iterator helpers.

@kmike
Copy link
Member

kmike commented Mar 13, 2015

If you have suggestions for new optimized methods for DAWG (e.g. for fuzzy match) then I'm open to PRs; it could be a good addition to the library.

@EliFinkelshteyn
Copy link
Author

Alright, I'll investigate a bit more, and if it still looks useful, I'll send over a PR. Closing the issue out here, since it'd be much easier to do for DAWG than the Marisa-Trie.

@jstypka
Copy link

jstypka commented Nov 27, 2015

Actually, this is the second time that this feature would be very useful to me, so I thought I might mention it here. I also looked through other Python trie libraries (PyTrie, patricia-trie, python-trie) and neither of them support this.

My usecase (which I would expect to be not that rare) is walking down the tree while computing Levenshtein distance, I use it for fuzzy matches similar to @EliFinkelshteyn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants