Split string into pieces of fixed length and computing q-grams #471

hadley · 2022-01-20T19:48:00Z

Is there any existing function that does this?

x <- c("ab", "def", "g")
split_length(x, 1)
#> list(c("a", "b"), c("d", "e", "f"), "g")

split_length(x, 2)
#> list("ab", c("de", "f"), "g")

I feel like this has to be a simple application of an existing function, but I can't figure it out.

hadley · 2022-01-20T19:54:32Z

I guess I can get pretty close with stri_sub_all():

library(stringi)

str_split_length <- function(x, n = 1) {
  max_length <- max(stri_length(x))
  idx <- seq(1, max_length, by = n)
  
  stri_sub_all(x, cbind(idx, length = n))
}

x <- c("ab", "def", "g")
str_split_length(x, 1)
str_split_length(x, 2)

I'd just need to clean up the trailing "".

gagolews · 2022-01-21T00:01:28Z

Related idea (not yet implemented): #31

But yeah, the question is why would anyone need it? Computing q-grams maybe?

hadley · 2022-01-21T02:01:16Z

Hmmm, maybe that's a better framing? It's like str_split_ngram() where you provide the boundary (character, word, ...), q/n, and whether or not you want overlaps? Then it becomes a tool that could underly (e.g.) https://juliasilge.github.io/tidytext/reference/unnest_ngrams.html

I was thinking of it mostly as a complement to tidyr::separate(df, x, by = c(1, 5, 10) — that splits a string up into a fixed number of pieces that go into columns. What's the parallel if you don't know how many pieces there are, and hence want the result to end up in rows? My current motivation is filling a hole a in 2d matrix of functions; I'm not sure if this arises frequently in practice.

gagolews · 2022-01-21T02:41:11Z

There's also this:

stringi::stri_split_boundaries(c("ab", "def", "g"), type="character")
[[1]]
[1] "a" "b"

[[2]]
[1] "d" "e" "f"

[[3]]
[1] "g"

which extracts grapheme clusters

hadley · 2022-01-21T03:15:02Z

Yeah, I think I'll start with that, then paste together into n-grams. Not super efficient but easy to implement and then we can see if it's actually useful.

gagolews · 2022-01-21T03:38:06Z

I might implement both here (the overlapping and non-overlapping splits), but not today :)

mikmart · 2022-02-05T12:46:09Z

@gagolews Re: why would anyone need it? It's a pretty niche use, but I actually needed exactly this the other day.

I had some data collected from two sources that included MAC addresses as strings. However, one source included colon separators ("01:23:45:67:89:AB") and another had removed them ("0123456789AB"). I wanted to harmonize the format to include the separators, which then meant I needed to split the strings from the latter source into chunks of length 2.

There's a host of solutions for the length 1 vector case in this StackOverflow question. I ended up lapply()ing one of them (see below), but a purpose-built function would have been a great help (there was a decent amount of data, so looping in R was slow).

str_chunk <- function(x, n) {
  substring(x, seq(1, nchar(x), n), seq(n, nchar(x), n))
}

fix_mac <- function(x) {
  sapply(lapply(x, str_chunk, 2), paste, collapse = ":")
}

fix_mac(c("0123456789AB", "0123456789AB"))
#> [1] "01:23:45:67:89:AB" "01:23:45:67:89:AB"

gagolews · 2022-02-06T01:53:49Z

Yep, good point. Plus, I guess it'd be nice to have an options for handling chunks of different lengths (e.g., first 2 code points, then 3, then 1, etc.)

mikmart · 2022-02-06T11:19:03Z

Yeah that would be useful! e.g. a similar "reconstruction" case with UUIDs would be chunks of 8, 4, 4, 4, and 12.

gagolews changed the title ~~Split string into pieces of fixed length?~~ Split string into pieces of fixed length and computing q-grams Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split string into pieces of fixed length and computing q-grams #471

Split string into pieces of fixed length and computing q-grams #471

hadley commented Jan 20, 2022

hadley commented Jan 20, 2022

gagolews commented Jan 21, 2022

hadley commented Jan 21, 2022

gagolews commented Jan 21, 2022

hadley commented Jan 21, 2022

gagolews commented Jan 21, 2022

mikmart commented Feb 5, 2022 •

edited

Loading

gagolews commented Feb 6, 2022

mikmart commented Feb 6, 2022

Split string into pieces of fixed length and computing q-grams #471

Split string into pieces of fixed length and computing q-grams #471

Comments

hadley commented Jan 20, 2022

hadley commented Jan 20, 2022

gagolews commented Jan 21, 2022

hadley commented Jan 21, 2022

gagolews commented Jan 21, 2022

hadley commented Jan 21, 2022

gagolews commented Jan 21, 2022

mikmart commented Feb 5, 2022 • edited Loading

gagolews commented Feb 6, 2022

mikmart commented Feb 6, 2022

mikmart commented Feb 5, 2022 •

edited

Loading