Search algorithms #307

Bodigrim · 2020-10-18T17:47:35Z

I've been thinking about splitOn and replace for ByteString. They can be expressed in terms of breakSubstring, but the more I look at the latter the more doubts I get about its implementation.

bytestring/Data/ByteString.hs

Lines 1596 to 1613 in bd5412c

    
           karpRabin src 
        
               | length src < lp = (src,empty) 
        
               | otherwise = search (rollingHash $ unsafeTake lp src) lp 
        
             where 
        
               k           = 2891336453 :: Word32 
        
               rollingHash = foldl' (\h b -> h * k + fromIntegral b) 0 
        
               hp          = rollingHash pat 
        
               m           = k ^ lp 
        
               get = fromIntegral . unsafeIndex src 
        
               search !hs !i 
        
                   | hp == hs && pat == unsafeTake lp b = u 
        
                   | length src <= i                    = (src,empty) -- not found 
        
                   | otherwise                          = search hs' (i + 1) 
        
                 where 
        
                   u@(_, b) = unsafeSplitAt (i - lp) src 
        
                   hs' = hs * k + 
        
                         get i - 
        
                         m * get (i - lp)

What's the reason for Karp-Rabin here? It is great to search for multiple patterns at once, but this is not our case. I suspect that for non-pathological cases even a naive loop with memcmp could very well be faster. And for pathological inputs Karp-Rabin is O(mn) anyways. If we want to fix the worst case scenario, we should employ Knuth-Moris-Pratt or Boyer-Moore.

The text was updated successfully, but these errors were encountered:

kozross · 2021-03-13T02:05:32Z

For what it's worth, I've been doing matching perf shoot-outs between text and text-ascii (which uses bytestring under the hood). Originally, text-ascii used NSN, but this turned out to be severely non-competitive. However, having rewritten it based on bytestring's Word8 finding implementation (basically a heuristic optimization of the naive algorithm, suggested by Crochemore et al), I got much better results, even at -O0:

All
  Matching
    Match, Text:         OK (0.15s)
      5.8 ms ± 434 μs,   0 B  allocated,   0 B  copied
    Match, AsciiText:    OK (0.77s)
      402 μs ±  17 μs, 619 KB allocated, 199 B  copied
    Match 1, Text:       OK (0.39s)
       14 ms ± 1.2 ms,   0 B  allocated,   0 B  copied
    Match 1, AsciiText:  OK (0.72s)
      3.1 ms ±  64 μs,   0 B  allocated,   0 B  copied
    No match, Text:      OK (0.15s)
      5.1 ms ± 381 μs,   0 B  allocated,   0 B  copied
    No match, AsciiText: OK (0.23s)
      3.9 ms ± 238 μs, 9.5 MB allocated, 1.7 KB copied

At -O2, it's not even close:

All
  Matching
    Match, Text:         OK (1.46s)
      105 ms ± 3.1 ms, 754 MB allocated, 154 KB copied
    Match, AsciiText:    OK (0.77s)
      403 μs ±  26 μs, 619 KB allocated, 199 B  copied
    Match 1, Text:       OK (0.19s)
       29 ms ± 1.4 ms, 105 MB allocated,  19 KB copied
    Match 1, AsciiText:  OK (0.28s)
      5.1 ms ± 330 μs,   0 B  allocated,   0 B  copied
    No match, Text:      OK (0.60s)
       92 ms ± 2.0 ms, 675 MB allocated, 116 KB copied
    No match, AsciiText: OK (0.44s)
      3.7 ms ±  84 μs, 9.5 MB allocated, 1.7 KB copied

The allocations (in my implementation) are still something I'm ironing out. For reference, this is working over a 6ish megabyte text; the 'Match' case tries a non-singleton needle with 96 occurrences, the 'Match 1' case tries a singleton needle, and 'No match' tries a non-matching needle. The implementation (of the index-finding function for matching) is below - BS is ByteString, P is Prelude:

indices :: ByteString -> ByteString -> [Int]
indices needle haystack = case BS.uncons needle of 
  Nothing -> []
  Just (h, t) -> case BS.elemIndices h haystack of
    [] -> []
    whole@(i : is) -> if BS.null t 
                then whole
                else go i is
  where
    go :: Int -> [Int] -> [Int]
    go i is = let fragment = BS.take needleLen . BS.drop i $ haystack in
      if fragment == needle
      then i : case P.dropWhile (\j -> j - i < needleLen) is of 
        [] -> []
        (j : js) -> go j js 
      else case is of 
        [] -> []
        (j : js) -> go j js
    needleLen :: Int
    needleLen = BS.length needle

I would argue that this is an excellent choice.

Bodigrim · 2021-03-13T23:04:15Z

@kozross am I reading your benchmarks right? Does text become 20x slower when compiled with -O2?

kozross · 2021-03-14T00:18:31Z

@Bodigrim - you are. It gets better - it's about a factor of 4 worse than even this on GHC 9 (these results are GHC 8.6).

Bodigrim · 2021-03-14T00:22:00Z

@kozross Has it been reported to text issue tracker?

kozross · 2021-03-14T00:26:55Z

@Bodigrim Not presently. I'm trying to get my own house in order, since I'm not in charge of (or involved in) text, but text-ascii is mine and mine alone. I shared this because I figured you might wanna re-use what I end up with, since it'll work for bytestring with zero changes.

Bodigrim · 2021-03-14T00:41:07Z

We are kick-starting a new maintainers team for text, so it may be worth discussing there.

I just run text benchmarks (haskell/text#315) - and there is indeed a huge regression between GHC 8.6 and GHC 9.0 for many functions, like 2-3 times slower. This is simply terrible on its own, saying nothing about -O2 slow down.

I shared this because I figured you might wanna re-use what I end up with, since it'll work for bytestring with zero changes.

Sure, it is very nice of you, thanks for sharing.

kozross · 2021-03-14T00:42:49Z

@Bodigrim Once I've figured out my weird alloc-related issue, I'll definitely post a repro on text's issue tracker, with more specifics.

Bodigrim added the discussion/rfc label Oct 18, 2020

Bodigrim mentioned this issue Jan 3, 2021

Iron out API discrepancies #289

Open

kozross mentioned this issue Mar 15, 2021

Documentation of Data.Text.replace haskell/text#303

Open

sjakobi mentioned this issue Mar 17, 2021

Remove INLINE Pragma on indices haskell/text#219

Draft

subterfugue mentioned this issue May 20, 2023

Lazy variants of isInfixOf and breakSubstring #30

Open

Bodigrim mentioned this issue Jul 9, 2023

Fix isInfixOf and breakSubString in Word16 haskell/filepath#199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search algorithms #307

Search algorithms #307

Bodigrim commented Oct 18, 2020 •

edited

Loading

kozross commented Mar 13, 2021 •

edited

Loading

Bodigrim commented Mar 13, 2021

kozross commented Mar 14, 2021 •

edited

Loading

Bodigrim commented Mar 14, 2021

kozross commented Mar 14, 2021 •

edited

Loading

Bodigrim commented Mar 14, 2021 •

edited

Loading

kozross commented Mar 14, 2021

Search algorithms #307

Search algorithms #307

Comments

Bodigrim commented Oct 18, 2020 • edited Loading

kozross commented Mar 13, 2021 • edited Loading

Bodigrim commented Mar 13, 2021

kozross commented Mar 14, 2021 • edited Loading

Bodigrim commented Mar 14, 2021

kozross commented Mar 14, 2021 • edited Loading

Bodigrim commented Mar 14, 2021 • edited Loading

kozross commented Mar 14, 2021

Bodigrim commented Oct 18, 2020 •

edited

Loading

kozross commented Mar 13, 2021 •

edited

Loading

kozross commented Mar 14, 2021 •

edited

Loading

kozross commented Mar 14, 2021 •

edited

Loading

Bodigrim commented Mar 14, 2021 •

edited

Loading