Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search algorithms #307

Open
Bodigrim opened this issue Oct 18, 2020 · 7 comments
Open

Search algorithms #307

Bodigrim opened this issue Oct 18, 2020 · 7 comments

Comments

@Bodigrim
Copy link
Contributor

Bodigrim commented Oct 18, 2020

I've been thinking about splitOn and replace for ByteString. They can be expressed in terms of breakSubstring, but the more I look at the latter the more doubts I get about its implementation.

bytestring/Data/ByteString.hs

Lines 1596 to 1613 in bd5412c

karpRabin src
| length src < lp = (src,empty)
| otherwise = search (rollingHash $ unsafeTake lp src) lp
where
k = 2891336453 :: Word32
rollingHash = foldl' (\h b -> h * k + fromIntegral b) 0
hp = rollingHash pat
m = k ^ lp
get = fromIntegral . unsafeIndex src
search !hs !i
| hp == hs && pat == unsafeTake lp b = u
| length src <= i = (src,empty) -- not found
| otherwise = search hs' (i + 1)
where
u@(_, b) = unsafeSplitAt (i - lp) src
hs' = hs * k +
get i -
m * get (i - lp)

What's the reason for Karp-Rabin here? It is great to search for multiple patterns at once, but this is not our case. I suspect that for non-pathological cases even a naive loop with memcmp could very well be faster. And for pathological inputs Karp-Rabin is O(mn) anyways. If we want to fix the worst case scenario, we should employ Knuth-Moris-Pratt or Boyer-Moore.

@kozross
Copy link
Contributor

kozross commented Mar 13, 2021

For what it's worth, I've been doing matching perf shoot-outs between text and text-ascii (which uses bytestring under the hood). Originally, text-ascii used NSN, but this turned out to be severely non-competitive. However, having rewritten it based on bytestring's Word8 finding implementation (basically a heuristic optimization of the naive algorithm, suggested by Crochemore et al), I got much better results, even at -O0:

All
  Matching
    Match, Text:         OK (0.15s)
      5.8 ms ± 434 μs,   0 B  allocated,   0 B  copied
    Match, AsciiText:    OK (0.77s)
      402 μs ±  17 μs, 619 KB allocated, 199 B  copied
    Match 1, Text:       OK (0.39s)
       14 ms ± 1.2 ms,   0 B  allocated,   0 B  copied
    Match 1, AsciiText:  OK (0.72s)
      3.1 ms ±  64 μs,   0 B  allocated,   0 B  copied
    No match, Text:      OK (0.15s)
      5.1 ms ± 381 μs,   0 B  allocated,   0 B  copied
    No match, AsciiText: OK (0.23s)
      3.9 ms ± 238 μs, 9.5 MB allocated, 1.7 KB copied

At -O2, it's not even close:

All
  Matching
    Match, Text:         OK (1.46s)
      105 ms ± 3.1 ms, 754 MB allocated, 154 KB copied
    Match, AsciiText:    OK (0.77s)
      403 μs ±  26 μs, 619 KB allocated, 199 B  copied
    Match 1, Text:       OK (0.19s)
       29 ms ± 1.4 ms, 105 MB allocated,  19 KB copied
    Match 1, AsciiText:  OK (0.28s)
      5.1 ms ± 330 μs,   0 B  allocated,   0 B  copied
    No match, Text:      OK (0.60s)
       92 ms ± 2.0 ms, 675 MB allocated, 116 KB copied
    No match, AsciiText: OK (0.44s)
      3.7 ms ±  84 μs, 9.5 MB allocated, 1.7 KB copied

The allocations (in my implementation) are still something I'm ironing out. For reference, this is working over a 6ish megabyte text; the 'Match' case tries a non-singleton needle with 96 occurrences, the 'Match 1' case tries a singleton needle, and 'No match' tries a non-matching needle. The implementation (of the index-finding function for matching) is below - BS is ByteString, P is Prelude:

indices :: ByteString -> ByteString -> [Int]
indices needle haystack = case BS.uncons needle of 
  Nothing -> []
  Just (h, t) -> case BS.elemIndices h haystack of
    [] -> []
    whole@(i : is) -> if BS.null t 
                then whole
                else go i is
  where
    go :: Int -> [Int] -> [Int]
    go i is = let fragment = BS.take needleLen . BS.drop i $ haystack in
      if fragment == needle
      then i : case P.dropWhile (\j -> j - i < needleLen) is of 
        [] -> []
        (j : js) -> go j js 
      else case is of 
        [] -> []
        (j : js) -> go j js
    needleLen :: Int
    needleLen = BS.length needle

I would argue that this is an excellent choice.

@Bodigrim
Copy link
Contributor Author

@kozross am I reading your benchmarks right? Does text become 20x slower when compiled with -O2?

@kozross
Copy link
Contributor

kozross commented Mar 14, 2021

@Bodigrim - you are. It gets better - it's about a factor of 4 worse than even this on GHC 9 (these results are GHC 8.6).

@Bodigrim
Copy link
Contributor Author

@kozross Has it been reported to text issue tracker?

@kozross
Copy link
Contributor

kozross commented Mar 14, 2021

@Bodigrim Not presently. I'm trying to get my own house in order, since I'm not in charge of (or involved in) text, but text-ascii is mine and mine alone. I shared this because I figured you might wanna re-use what I end up with, since it'll work for bytestring with zero changes.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Mar 14, 2021

We are kick-starting a new maintainers team for text, so it may be worth discussing there.

I just run text benchmarks (haskell/text#315) - and there is indeed a huge regression between GHC 8.6 and GHC 9.0 for many functions, like 2-3 times slower. This is simply terrible on its own, saying nothing about -O2 slow down.

I shared this because I figured you might wanna re-use what I end up with, since it'll work for bytestring with zero changes.

Sure, it is very nice of you, thanks for sharing.

@kozross
Copy link
Contributor

kozross commented Mar 14, 2021

@Bodigrim Once I've figured out my weird alloc-related issue, I'll definitely post a repro on text's issue tracker, with more specifics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants