Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected behaviour for grouped data + sample_n_keys #104

Open
emitanaka opened this issue Nov 1, 2020 · 5 comments
Open

Expected behaviour for grouped data + sample_n_keys #104

emitanaka opened this issue Nov 1, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@emitanaka
Copy link

I was hoping to sample 1 key per group as below but the output seems to be a bit random where I get some correct but another sampling gets 2 samples instead of 1 and so on.

library(tsibble)
library(brolgar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(1)
out <- ChickWeight %>% 
  as_tsibble(key = Chick, index = Time) %>% 
  group_by(Diet) %>% # 4 diets
  sample_n_keys(1) # expecting 1 chick per diet

# shows as 50 chicks - is it a tsibble thing?
out
#> # A tsibble: 48 x 4 [1]
#> # Key:       Chick [50]
#> # Groups:    Diet [4]
#>    weight  Time Chick Diet 
#>     <dbl> <dbl> <ord> <fct>
#>  1     41     0 13    1    
#>  2     48     2 13    1    
#>  3     53     4 13    1    
#>  4     60     6 13    1    
#>  5     65     8 13    1    
#>  6     67    10 13    1    
#>  7     71    12 13    1    
#>  8     70    14 13    1    
#>  9     71    16 13    1    
#> 10     81    18 13    1    
#> # … with 38 more rows

# actual number of chicks sampled
# the number sampled seems random. Sometimes it is correct, some times like below.
out %>% 
  distinct(Chick, Diet)
#> # A tibble: 7 x 2
#> # Groups:   Diet [4]
#>   Chick Diet 
#>   <ord> <fct>
#> 1 13    1    
#> 2 30    2    
#> 3 22    2    
#> 4 37    3    
#> 5 36    3    
#> 6 45    4    
#> 7 43    4

Created on 2020-11-01 by the reprex package (v0.3.0.9001)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.1 (2020-06-06)
#>  os       macOS Catalina 10.15.7      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2020-11-01                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package        * version    date       lib source                            
#>  anytime          0.3.9      2020-08-27 [1] CRAN (R 4.0.2)                    
#>  assertthat       0.2.1      2019-03-21 [2] CRAN (R 4.0.0)                    
#>  backports        1.1.10     2020-09-15 [1] CRAN (R 4.0.2)                    
#>  brolgar        * 0.0.6.9100 2020-10-30 [1] Github (njtierney/brolgar@28e95bb)
#>  cli              2.1.0      2020-10-12 [1] CRAN (R 4.0.2)                    
#>  colorspace       1.4-1      2019-03-18 [1] CRAN (R 4.0.2)                    
#>  crayon           1.3.4      2017-09-16 [2] CRAN (R 4.0.0)                    
#>  digest           0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                    
#>  distributional   0.2.1      2020-10-06 [1] CRAN (R 4.0.2)                    
#>  dplyr          * 1.0.1      2020-07-26 [1] Github (tidyverse/dplyr@16647fc)  
#>  ellipsis         0.3.1      2020-05-15 [2] CRAN (R 4.0.0)                    
#>  evaluate         0.14       2019-05-28 [2] CRAN (R 4.0.0)                    
#>  fabletools       0.2.1      2020-09-03 [1] CRAN (R 4.0.2)                    
#>  fansi            0.4.1      2020-01-08 [2] CRAN (R 4.0.0)                    
#>  farver           2.0.3.9000 2020-07-24 [1] Github (thomasp85/farver@f1bcb56) 
#>  fs               1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                    
#>  generics         0.0.2      2018-11-29 [2] CRAN (R 4.0.0)                    
#>  ggplot2          3.3.2      2020-06-19 [1] CRAN (R 4.0.2)                    
#>  glue             1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                    
#>  gtable           0.3.0      2019-03-25 [2] CRAN (R 4.0.0)                    
#>  highr            0.8        2019-03-20 [2] CRAN (R 4.0.0)                    
#>  htmltools        0.5.0      2020-06-16 [1] CRAN (R 4.0.2)                    
#>  knitr            1.29       2020-06-23 [1] CRAN (R 4.0.2)                    
#>  lifecycle        0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                    
#>  lubridate        1.7.9      2020-06-08 [2] CRAN (R 4.0.1)                    
#>  magrittr         1.5        2014-11-22 [2] CRAN (R 4.0.0)                    
#>  munsell          0.5.0      2018-06-12 [2] CRAN (R 4.0.0)                    
#>  pillar           1.4.6      2020-07-10 [1] CRAN (R 4.0.1)                    
#>  pkgconfig        2.0.3      2019-09-22 [2] CRAN (R 4.0.0)                    
#>  purrr            0.3.4      2020-04-17 [2] CRAN (R 4.0.0)                    
#>  R6               2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                    
#>  Rcpp             1.0.5      2020-07-06 [1] CRAN (R 4.0.0)                    
#>  reprex           0.3.0.9001 2020-08-08 [1] Github (tidyverse/reprex@9594ee9) 
#>  rlang            0.4.8      2020-10-08 [1] CRAN (R 4.0.2)                    
#>  rmarkdown        2.3        2020-06-18 [1] CRAN (R 4.0.2)                    
#>  rstudioapi       0.11       2020-02-07 [2] CRAN (R 4.0.0)                    
#>  scales           1.1.1      2020-05-11 [2] CRAN (R 4.0.0)                    
#>  sessioninfo      1.1.1      2018-11-05 [2] CRAN (R 4.0.0)                    
#>  stringi          1.4.6      2020-02-17 [2] CRAN (R 4.0.0)                    
#>  stringr          1.4.0      2019-02-10 [2] CRAN (R 4.0.0)                    
#>  styler           1.3.2      2020-02-23 [1] CRAN (R 4.0.1)                    
#>  tibble           3.0.4      2020-10-12 [1] CRAN (R 4.0.2)                    
#>  tidyr            1.1.2      2020-08-27 [1] CRAN (R 4.0.2)                    
#>  tidyselect       1.1.0      2020-05-11 [2] CRAN (R 4.0.0)                    
#>  tsibble        * 0.9.3.9000 2020-11-01 [1] Github (tidyverts/tsibble@e749eb6)
#>  utf8             1.1.4      2018-05-24 [2] CRAN (R 4.0.0)                    
#>  vctrs            0.3.2.9000 2020-07-26 [1] Github (r-lib/vctrs@df8a659)      
#>  withr            2.3.0      2020-09-22 [1] CRAN (R 4.0.2)                    
#>  xfun             0.16       2020-07-24 [1] CRAN (R 4.0.2)                    
#>  yaml             2.2.1      2020-02-01 [1] CRAN (R 4.0.2)                    
#> 
#> [1] /Users/etan0038/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
@dicook
Copy link
Collaborator

dicook commented Nov 1, 2020 via email

@dicook
Copy link
Collaborator

dicook commented Nov 1, 2020

Oh, I think it is simply that brolgar assumes unique keys. Similarly, tsibble is also assuming unique keys.

You need to use tibble/dplyr: group_by(your_character_variable) %>% sample_n()

Think of your variable as a grouping variable rather than an id variable. It would be interesting to think of handling this as an extension of tsibble, brolgar. Its a tsibble with replicates %^)

@dicook
Copy link
Collaborator

dicook commented Nov 1, 2020

Oh, nope, its something not working in sample_n_keys(), eg

wages %>%
group_by(black) %>%
sample_n_keys(2)

ignores the group_by

@njtierney
Copy link
Owner

Thanks for posting the issue, @emitanaka !

This seems like a bug, I'll fix this before submitting to CRAN.

@njtierney njtierney added the bug Something isn't working label Nov 1, 2020
@emitanaka
Copy link
Author

@dicook actually the behaviour is random. If you repeat your command, occasionally it shows some rows. The bug could be related to that it still thinks the number of keys is the same as the big data (not sure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants