-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search algorithms #307
Comments
For what it's worth, I've been doing matching perf shoot-outs between
At -O2, it's not even close:
The allocations (in my implementation) are still something I'm ironing out. For reference, this is working over a 6ish megabyte text; the 'Match' case tries a non-singleton needle with 96 occurrences, the 'Match 1' case tries a singleton needle, and 'No match' tries a non-matching needle. The implementation (of the index-finding function for matching) is below - indices :: ByteString -> ByteString -> [Int]
indices needle haystack = case BS.uncons needle of
Nothing -> []
Just (h, t) -> case BS.elemIndices h haystack of
[] -> []
whole@(i : is) -> if BS.null t
then whole
else go i is
where
go :: Int -> [Int] -> [Int]
go i is = let fragment = BS.take needleLen . BS.drop i $ haystack in
if fragment == needle
then i : case P.dropWhile (\j -> j - i < needleLen) is of
[] -> []
(j : js) -> go j js
else case is of
[] -> []
(j : js) -> go j js
needleLen :: Int
needleLen = BS.length needle I would argue that this is an excellent choice. |
@kozross am I reading your benchmarks right? Does |
@Bodigrim - you are. It gets better - it's about a factor of 4 worse than even this on GHC 9 (these results are GHC 8.6). |
@kozross Has it been reported to |
@Bodigrim Not presently. I'm trying to get my own house in order, since I'm not in charge of (or involved in) |
We are kick-starting a new maintainers team for I just run
Sure, it is very nice of you, thanks for sharing. |
@Bodigrim Once I've figured out my weird alloc-related issue, I'll definitely post a repro on |
I've been thinking about
splitOn
andreplace
forByteString
. They can be expressed in terms ofbreakSubstring
, but the more I look at the latter the more doubts I get about its implementation.bytestring/Data/ByteString.hs
Lines 1596 to 1613 in bd5412c
What's the reason for Karp-Rabin here? It is great to search for multiple patterns at once, but this is not our case. I suspect that for non-pathological cases even a naive loop with
memcmp
could very well be faster. And for pathological inputs Karp-Rabin is O(mn) anyways. If we want to fix the worst case scenario, we should employ Knuth-Moris-Pratt or Boyer-Moore.The text was updated successfully, but these errors were encountered: