-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vectorize_all for stri_detect_* #404
Comments
Good idea, but this can also be easily implemented with the |
To clarify: I was not suggesting using something like my hacky function but wondering if faster implementation in RCPP would make sense. I am also not sure what Let me state my argument a bit different: My use case is text processing of large character vectors to use in ML Models. If certain words of large wordlists appear in a text column, I can modify a feature column in my data.table. Therefore, using Since I use |
So in other words, you're advocating for a set of functions for:
Let's call them Could you provide some "emulated" examples - like virtual calls on some specific inputs and the desired outputs you'd like to see? |
Sure. I have added examples with word boundaries and fruit <- c("banana pineapple", "apple banana pear", "applebanana pear")
# Case 1
stri_match_any(str = fruit, patterns = c("\\bbanana\\","\\bapple\\b"))
[1] TRUE TRUE FALSE
# Same as:
stri_detect_regex(fruit, "(\\bbanana\\b)|(\\bapple\\b)")
[1] TRUE TRUE FALSE
# Case 2
stri_match_all(str = fruit, patterns = c("\\bpear\\b","\\bapple\\b"))
[1] FALSE TRUE FALSE
# Same as
stri_detect_regex(fruit, "(?=.*\\bpear\\b)(?=.*\\bapple\\b)")
[1] FALSE TRUE FALSE |
If this is only about searching for fixed patterns, possibly a Trie-like data structure could do the trick, especially if the number of patterns was large. Matching of whole words could be done using ICU's BreakIterator, internally. At a first glance, I'm afraid that any other implementation will not be significantly more efficient than running |
I mean, I kind of like the idea of these functions, generally. |
I very much like that option in
stri_replace_*
and am wondering whystri_detect_*
does not have it.I have built a function that does it for me and adds the functionality to combine matches with a logical operator, but It would be great to access this with full C++ speed.
The text was updated successfully, but these errors were encountered: