-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding edges() and iteredges() Functions for DAWGs #1
base: master
Are you sure you want to change the base?
Conversation
2 similar comments
…tionDawgs; adding tests for all
…acing dev data for those
…d for all new edges methods
These latest additions add edges() and iteredges() functionality for all applicable DAWGs and clean up the code since the original pull request. They complete all the work I planned to implement that we originally discussed. Would love to hear your thoughts @kmike. |
if prefix: | ||
index = self.dct.follow_bytes(prefix, index) | ||
if not index: | ||
return res | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is backwards incompatible - .items should return an empty list, not None here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Will fix (and add a test for future).
1 similar comment
@@ -95,6 +95,77 @@ def size(self): | |||
return len(self._units) | |||
|
|||
|
|||
class EdgeFollower(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for separating Completer and EdgeFollower
…iate comments to doc strings
1 similar comment
… always be used-- not utf-8
yield item | ||
|
||
def edges(self, prefix=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that .edges method should return the same data regardless of DAWG class. It it returns a list of strings in a base class it should return a list of strings in all subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For BytesDAWG it could make sense to filter out edges leading to the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's similar data for all. It never returns a list of strings. It always returns a list of 2-tuples. For dawgs with no data, the tuples are (str, True)
for terminal edges and (str, False)
for non-terminals.
For dawgs with data, they're (str, data)
for terminal edges, and (str, False)
for non-terminals. Since data evaluates to true in a boolean situation, this seems most logical to me. If you want the data in an edge, you have it. If you want to just use the edges and know whether they're terminals or not, you can do that the same way across dawgs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really want them to be the same, we could make them return (str, True)
for terminal edges always, and just add an extra edges_with_data()
method for dawgs that provide any kind of data storage. That actually seems most consistent to me. If you agree, I'll make that addition.
1 similar comment
…a and iteredges_data for appropriate dawgs; adding tests for new methods
@kmike latest change makes Everything looks done as far as I can tell. I'd like to start using this in prod soon. I can use my fork, but if you plan to merge soon, that would be even better. |
Hi @EliFinkelshteyn,
It seems the main complexity is that some characters are represented by multiple transitions, right? You solved it by trying to decode data until is succeeds, which is reasonable for UTF-8. Regarding the API - so .edges is like .keys, but it only traverses graph to depth of 1 unicode character, and also returns if the result is terminal or not? I think it is reasonable. One question is whether it should return full keys or partial keys, without the prefix. You've implemented it the same way as Completer, which looks fine. Could you please add more tests? For example, based on https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fwrapper.py, the code which handles UnicodeDecodeErrors is untested; some conditions are also missing in dawgs.py (see https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fdawgs.py). Thanks for your PR! It is wel-written 👍 But I need a bit more time to review it. I'm not sure I'll be able to finish the review during this work week; weekend is more likely. |
So, there's actually an issue here. When unicode chars share the same first bytes, this will only return one of the chars. I am working on fixing that now. I realized you can tell exactly how many bytes are in a unicode char by how many leading ones the first byte has, so I can use this to speed up the whole thing a bit as well. |
A good catch. For some reason I thought that UTF8 synchronization is enough to make repeated decoding work, but it is not. |
1 similar comment
1 similar comment
That was my bad. I didn't know python3.2 has an issue with 'u'. I'm also just tired, so this took way longer than it should have. Should all be fixed and working now though. |
Ping here. Anything else you want done for this to be merged in? |
As discussed at pytries/marisa-trie#20, this is support for adding the edges() and iteredges() methods for CompletionDAWG. If this looks good, I'll add similar support for RecordDAWGs and ByteDAWGs. The code isn't as optimized as it could be, but it works, it's clean (IMO), and it's fast enough for me.