-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
values() bug with unicode characters #20
Comments
* Better handling for unicode * Found pytries/datrie#20
This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet. Unfortunately, libdatrie (which is this wrapper using) doesn't really support unicode; it only supports mapping a subset of Unicode (alphabet you're passing) to a single-byte 0-255 range. So, to fix that the following is needed:
I don't have plans to work on this in near future, so contributions are welcome. What datrie version are you using? There was validation improvements in datrie 0.7. |
* Failed to complete encoding tests due to pytries/datrie#20 * Enable these tests @ CircleCI #1368
I've taken a quick look at your code, and it seems you're building a trie at the startup and then just using it. If so, you may give https://github.com/kmike/DAWG a shot - it supports unicode properly, and it is faster and more memory efficient than datrie. The downside is that you can't modify DAWG after building. |
Are you sure? I mean... those special characters are translated to multi-byte somewhere, and those bytes are always 0-255. I had to change to >>> from datrie import Trie
>>> import string
>>> m = Trie(string.printable)
>>> m[u'a'] = 1
>>> m[u'á'] = 1
>>> m[u'é'] = 1
>>> u'ì' in m
True This looks like a similar/related bug.
I believe 3) is the best option. Before sending anything to the library you simply
Using datrie==0.6.1 |
Will try it out, thanks for the link. If nothing works, I might end up working on 3), does that make sense to you? |
After some minor tests: DAWG won't be useful in my case since my values are instances, not bytes. >>> class A():
... pass
...
>>> a = A()
>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', a)]
>>> bytes_dawg = dawg.BytesDAWG(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dawg.pyx", line 473, in dawg.BytesDAWG.__init__ (src/dawg.cpp:7904)
File "dawg.pyx", line 288, in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5795)
File "dawg.pyx", line 37, in dawg.DAWG.__init__ (src/dawg.cpp:1982)
File "dawg.pyx", line 472, in genexpr (src/dawg.cpp:7725)
TypeError: Expected bytes, got instance
>>> |
Source required to do 3) is in C. Haven't touched that in ages. Won't be able to help. |
Even better than (3) is to talk to libdatrie author and make its "unicode support" optional. Currently, alphabet is a range of unicode chars that datrie support. It doesn't use multibyte characters, it always encode char using a single byte. For other wrappers I usually encode/decode data using utf8, but with libdatrie it means extra work/slowdowns as encoding/decoding will happen twice (in a wrapper and in a library). You can use IntDAWG with indices and store values in a list (this is what datrie doing for you). Another option is https://github.com/kmike/marisa-trie - it is slower, but it is even more memory efficient, and it gives you indices for free. |
Running in the latest version of >>> from datrie import Trie
>>> m = Trie([chr(x) for x in range(0, 255)])
>>> m[u'core/encoding_utf8/\xe9.py'] = 1
>>> m[u'core/encoding_utf8/\u6539.py'] = 2
>>> m.values(u'core/encoding_utf8')
[1]
>>> u'core/encoding_utf8/\u6539.py' in m
False
>>> Note the last "False", which differs from the initially reported "True". This makes the datrie a little bit more consistent, but still a bug IMHO. At least it should raise an exception when trying to add something that afterwards it will tell you that it is not there. |
Oh man... you 💌 wrappers huh? |
Here is one more for your pleasure: https://github.com/kmike/hat-trie :) I was looking for a perfect data structure at one point of my life.. I agree it should raise an exception. That said, it is documented that what you're trying to do won't work (see a warning here https://pypi.python.org/pypi/datrie/). |
Thanks for the hint, finally I ended up using the same idea but with |
Current unicode support is still buggy, #48 |
I would expect this output instead:
Using datrie==0.6.1
The text was updated successfully, but these errors were encountered: