Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text: add fun isSubsequenceOf #369

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions src/Data/Text.hs
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ module Data.Text
, isPrefixOf
, isSuffixOf
, isInfixOf
, isSubsequenceOf

-- ** View patterns
, stripPrefix
Expand Down Expand Up @@ -253,6 +254,7 @@ import qualified Language.Haskell.TH.Lib as TH
import qualified Language.Haskell.TH.Syntax as TH
import Text.Printf (PrintfArg, formatArg, formatString)
import System.Posix.Types (CSsize(..))
import Data.Function (on)

-- $setup
-- >>> import Data.Text
Expand Down Expand Up @@ -1881,6 +1883,43 @@ isInfixOf needle haystack
| otherwise = not . L.null . indices needle $ haystack
{-# INLINE [1] isInfixOf #-}

-- 2021-09-29: NOTE:
-- * after the implementation - determine & mention the big O
-- | The 'isSubsequenceOf' function takes the main text and the subsequnce
-- to find and returns 'True' iff the second argument is a subsequence
-- of the first.
--
-- "Subsequence" used in the meaning of: characters of the second argument
-- appear in same sequential order in the main data, to say second argument can
-- be derived by deleting some (any) or no elements from the first.
--
-- Examples:
--
-- >>> isSubsequenceOf "1234567" "1356"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the arguments her backwards? Generally isInfixOf etc. are used infix:

>>> "ello" `isInfixOf` "Hello, world!"
True

but this would be

"1234567" `isSubsequenceOf` "1356"

which should be False, if you read it in English. It should be isSubsequenceOf needles haystack - "1234567" is not a subsequence of "1356".

-- True
--
-- >>> isSubsequenceOf "1234567" "21"
-- False
--
-- `isSubsequenceOf` is the base case & implementation of fuzzy search.
isSubsequenceOf :: Text -> Text -> Bool
isSubsequenceOf tf sf
| length sf > length tf = False
| otherwise = subseqOf tf sf
where
subseqOf :: Text -> Text -> Bool
subseqOf t s =
on f uncons t s
where
f :: Maybe (Char, Text) -> Maybe (Char, Text) -> Bool
f _ Nothing = True
f Nothing _ = False
f (Just (tc,ts)) (Just (sc,ss)) =
subseqOf ts $
if tc == sc
then s
else ss
Comment on lines +1910 to +1921
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason not to just use

isSubsequenceOf needles haystack = List.isSubsequenceOf (T.unpack needles) (T.unpack haystack)

This code is doing exactly the same work, since T.unpack (as far as I understand) is just repeated calls to uncons. I can't see how there's be any performance here (though I could see an argument to using an optimised implementation in C if there's a more efficient algorithm that takes advantage of the UTF-8 encoding).


-------------------------------------------------------------------------------
-- * View patterns

Expand Down
43 changes: 43 additions & 0 deletions tests/Tests/Properties/Substrings.hs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ import qualified Data.Text.Internal.Lazy as TL (Text(..))
import qualified Data.Text.Internal.Lazy.Fusion as SL
import qualified Data.Text.Lazy as TL
import qualified Tests.SlowFunctions as Slow
import Control.Monad (replicateM)
import Data.List (nub, sort)

s_take n = L.take n `eqP` (unpackS . S.take n)
s_take_s (Small n) = L.take n `eqP` (unpackS . S.unstream . S.take n)
Expand Down Expand Up @@ -231,6 +233,47 @@ tl_isSuffixOf s = L.isSuffixOf s`eqP` TL.isSuffixOf (packS s)
t_isInfixOf s = L.isInfixOf s `eqP` T.isInfixOf (packS s)
tl_isInfixOf s = L.isInfixOf s `eqP` TL.isInfixOf (packS s)

-- | Generator for substrings that keeps the element order.
-- Aka: "1234567890" -> "245680"
genOrdSubseq :: T.Text -> Gen T.Text
genOrdSubseq txt =
T.pack . transform <$> genTransformMap
Comment on lines +239 to +240
Copy link

@ghost ghost Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand what this implementation is actually doing, but is this not all we need?

Suggested change
genOrdSubseq txt =
T.pack . transform <$> genTransformMap
genOrdSubseq txt =
T.pack . map fst . filter snd . zip (T.unpack txt) <$> vectorOf (T.length txt) arbitrary :: Gen [Bool]

We're just trying to select a random sample of elements from the given string in order right? Are there any other properties we need from this? Trying to ensure that the result is non-empty isn't too hard, you just generate two lists of Bools:

Suggested change
genOrdSubseq txt =
T.pack . transform <$> genTransformMap
genOrdSubseq txt = do
firstLen <- choose (1,T.length txt - 2)
prefix <- vectorOf firstLen arbitrary
suffix <- vectorOf (T.length txt - firstLen - 1) arbitrary
pure $ T.pack . map fst . filter snd . zip (T.unpack txt) $ prefix ++ True : suffix

(not sure if I have the maths right, but I'm trying to make a list of length txt which has a True somewhere in the middle)

where

pickN :: Gen Int
pickN =
choose (0, T.length txt)

pickNs :: Gen [Int]
pickNs =
fmap (sort . nub) $ (`replicateM` pickN) =<< pickN

growInst :: [Bool] -> Int -> [Bool]
growInst ls n =
ls
<> take (length ls - pred n) [True ..]
<> [False]

mkTransformInst :: [Bool] -> [Int] -> [Bool]
mkTransformInst bls [] =
bls
<> take (T.length txt - length bls) [True ..]
mkTransformInst bls (i:is) =
mkTransformInst
(growInst bls i)
is

mkTransformMap :: [a] -> [Int] -> [(a, Bool)]
mkTransformMap ls ixs =
zip ls (mkTransformInst mempty ixs)

genTransformMap :: (Gen [(Char, Bool)])
genTransformMap = fmap (mkTransformMap $ T.unpack txt) pickNs

transform :: [(Char, Bool)] -> [Char]
transform =
foldr (\ (c, b) as -> as <> if b then [c] else mempty) mempty

t_stripPrefix s = (fmap packS . L.stripPrefix s) `eqP` T.stripPrefix (packS s)
tl_stripPrefix s = (fmap packS . L.stripPrefix s) `eqP` TL.stripPrefix (packS s)

Expand Down