-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split string into pieces of fixed length and computing q-grams #471
Comments
I guess I can get pretty close with library(stringi)
str_split_length <- function(x, n = 1) {
max_length <- max(stri_length(x))
idx <- seq(1, max_length, by = n)
stri_sub_all(x, cbind(idx, length = n))
}
x <- c("ab", "def", "g")
str_split_length(x, 1)
str_split_length(x, 2) I'd just need to clean up the trailing |
Related idea (not yet implemented): #31 But yeah, the question is why would anyone need it? Computing q-grams maybe? |
Hmmm, maybe that's a better framing? It's like I was thinking of it mostly as a complement to |
There's also this:
which extracts grapheme clusters |
Yeah, I think I'll start with that, then paste together into n-grams. Not super efficient but easy to implement and then we can see if it's actually useful. |
I might implement both here (the overlapping and non-overlapping splits), but not today :) |
@gagolews Re: why would anyone need it? It's a pretty niche use, but I actually needed exactly this the other day. I had some data collected from two sources that included MAC addresses as strings. However, one source included colon separators ( There's a host of solutions for the length 1 vector case in this StackOverflow question. I ended up str_chunk <- function(x, n) {
substring(x, seq(1, nchar(x), n), seq(n, nchar(x), n))
}
fix_mac <- function(x) {
sapply(lapply(x, str_chunk, 2), paste, collapse = ":")
}
fix_mac(c("0123456789AB", "0123456789AB"))
#> [1] "01:23:45:67:89:AB" "01:23:45:67:89:AB" |
Yep, good point. Plus, I guess it'd be nice to have an options for handling chunks of different lengths (e.g., first 2 code points, then 3, then 1, etc.) |
Yeah that would be useful! e.g. a similar "reconstruction" case with UUIDs would be chunks of 8, 4, 4, 4, and 12. |
Is there any existing function that does this?
I feel like this has to be a simple application of an existing function, but I can't figure it out.
The text was updated successfully, but these errors were encountered: