feat: Implement separate() #107

etiennebacher · 2022-08-03T12:23:58Z

This PR implements separate() to split a column into several ones, either based on a regex or on location.

@nathaneastwood this PR is not complete, I put it as a draft here so that it is saved somewhere and that you can help with the TODO list if you have some time.

TODO:

fix when extra = "merge" (1 test failing so far)
implement argument fill (the way this argument works is not very clear to me)

Some examples:

suppressPackageStartupMessages(library(poorman))

df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
df
#>      x
#> 1 <NA>
#> 2  x.y
#> 3  x.z
#> 4  y.z
df %>% separate(x, c("A", "B"))
#>      A    B
#> 1 <NA> <NA>
#> 2    x    y
#> 3    x    z
#> 4    y    z

df <- data.frame(x = c(NA, "a1b", "c4d", "e9g"))
df
#>      x
#> 1 <NA>
#> 2  a1b
#> 3  c4d
#> 4  e9g
df %>% separate(x, c("A","B"), sep = "[0-9]")
#>      A    B
#> 1 <NA> <NA>
#> 2    a    b
#> 3    c    d
#> 4    e    g

df <- data.frame(x = c("x", "x y", "x y z", NA))
df
#>       x
#> 1     x
#> 2   x y
#> 3 x y z
#> 4  <NA>
df %>% separate(x, c("a", "b"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#>      a    b
#> 1    x <NA>
#> 2    x    y
#> 3    x    y
#> 4 <NA> <NA>

^{Created on 2022-08-03 by the reprex package (v2.0.1)}

nathaneastwood · 2022-08-03T15:06:20Z

Thanks, this looks nice. I'm going away until the end of the week, starting from tonight. I'll try to look properly when I'm back.

nathaneastwood · 2022-08-14T16:34:40Z

I took a look into some of this re extra = "merge". I think we could use the following to split up the strings

    n_max <- length(into)
    m <- gregexpr(sep, as.character(data[[col]]), perl = TRUE)
    if (n_max > 0) {
      m <- lapply(m, function(x) {
        i <- seq_along(x) < n_max
        structure(
          x[i],
          match.length = attr(x, "match.length")[i],
          index.type = attr(x, "index.type"),
          useBytes = attr(x, "useBytes")
        )
      })
    }
    regmatches(as.character(data[[col]]), m, invert = TRUE)

The problem is this doesn't get rid of "extra" information.

df <- data.frame(x = c("x", "x y", "x y z", NA))
#      a    b
# 1    x <NA>
# 2    x    y
# 3    x  y z
# 4 <NA> <NA>

Row 3 should be x y with a warning. This is different to the approach you took which is using strsplit().

nathaneastwood · 2022-08-14T16:37:13Z

Here is an example of what fill is supposed to do (taken from the tidyr tests):

r$> df                                                                 
# A tibble: 2 × 1
  x    
  <chr>
1 a b  
2 a b c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "left")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 NA    a     b    
2 a     b     c    

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "right")           
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "warn")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c    
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].

etiennebacher · 2023-01-26T20:34:39Z

Passing thought: might be worth implementing tidyr's new functions separate_wider_delim(), separate_wider_position(), separate_wider_regex(). separate() would then only call one of these depending on the type of input

nathaneastwood · 2023-01-26T23:11:30Z

I saw those. I may give them a miss. At some point I need to make a cut off and dplyr and tidyr 1.0.0 make sense to me.

etiennebacher · 2023-01-27T12:15:28Z

I understand that you can't cover all new things in dplyr and tidyr. What I meant is just that even from the developer's point of view, it might be easier/cleaner to create these 3 functions separately and then call them in separate(). And then, since those functions will exist, it won't cost much to export them.

nathaneastwood · 2023-01-27T13:26:18Z

Ah I see what you mean. Yeah that seems like a good point, actually.

etiennebacher marked this pull request as draft August 3, 2022 12:29

etiennebacher added 4 commits August 14, 2022 16:17

feat: implement separate

111200f

typo

d51a029

work with numeric columns

947de37

convert to char if factor

fec063c

nathaneastwood force-pushed the feat_implement_separate branch from 1f2d19c to fec063c Compare August 14, 2022 15:17

nathaneastwood changed the title ~~Feat: implement separate()~~ feat: Implement separate() Aug 14, 2022

etiennebacher marked this pull request as ready for review October 26, 2022 07:34

etiennebacher marked this pull request as draft October 26, 2022 07:34

Merge branch 'nathaneastwood:master' into feat_implement_separate

416d7fd

etiennebacher mentioned this pull request May 25, 2023

datawizard versions of separate() and unite() easystats/datawizard#423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement separate() #107

feat: Implement separate() #107

etiennebacher commented Aug 3, 2022 •

edited

Loading

nathaneastwood commented Aug 3, 2022 •

edited

Loading

nathaneastwood commented Aug 14, 2022

nathaneastwood commented Aug 14, 2022

etiennebacher commented Jan 26, 2023

nathaneastwood commented Jan 26, 2023

etiennebacher commented Jan 27, 2023

nathaneastwood commented Jan 27, 2023

feat: Implement separate() #107

Are you sure you want to change the base?

feat: Implement separate() #107

Conversation

etiennebacher commented Aug 3, 2022 • edited Loading

nathaneastwood commented Aug 3, 2022 • edited Loading

nathaneastwood commented Aug 14, 2022

nathaneastwood commented Aug 14, 2022

etiennebacher commented Jan 26, 2023

nathaneastwood commented Jan 26, 2023

etiennebacher commented Jan 27, 2023

nathaneastwood commented Jan 27, 2023

etiennebacher commented Aug 3, 2022 •

edited

Loading

nathaneastwood commented Aug 3, 2022 •

edited

Loading