Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some suggestions #22

Open
ramnathv opened this issue May 12, 2019 · 3 comments
Open

Some suggestions #22

ramnathv opened this issue May 12, 2019 · 3 comments
Assignees

Comments

@ramnathv
Copy link

@robinsones First off, great package! It makes creating and analyzing funnels clean and easy. Based on my navigation of the package's API, I had some suggestions:

You might want to make landed and registered datasets in the package so users can get started with the examples without having to define them. This can be extended to define out-of-memory versions of the data by taking advantage of dbplyr::src_dbi.

The notion of a join type is great and factors in the multiple scenarios that one might run into. However, I had trouble visualizing the execution and how each type results in a different output compared to the pure version of the join function. So I put together a helper function that allows one to visualize the differences. Based on this, here is my understanding of the working of after_join:

  1. Join the tables using the regular version of {mode}_join
  2. Filter out all records where event_x occurs before event_y
  3. For each user_id only retain versions of events specified by type
    • any-any will retain all records in Step (2)
    • first-first, first-firstafter, and lastbefore-firstafter will retain only ONE record per user.

Is my understanding correct? It might be useful to add something like this to the documentation so it is really clear to users how these after_joins work.

after_join_all
after_join_all <- function(x, y, by, mode = 'inner', ...){
  types <- c(
    'first-first', 'first-firstafter', 'lastbefore-firstafter',
    'any-firstafter', 'any-any'
  )
  by_type <- function(type){
    after_join(x, y, ..., mode = mode, type = type) %>% 
      mutate(!!type := 'Y') 
  }
  join_fun <- match.fun(paste0(mode, '_join'))
  all_types <- types %>% 
    purrr::map(by_type)
  join_fun(x, y, by = by) %>% 
    Reduce(left_join, all_types, init = .)
}
user_id timestamp.x timestamp.y first-first first-firstafter lastbefore-firstafter any-firstafter any-any
1 2018-07-01 2018-07-02
3 2018-07-02 2018-07-02
4 2018-07-01 2018-06-10 NA NA NA NA NA
4 2018-07-01 2018-07-02 NA
4 2018-07-04 2018-06-10 NA NA NA NA NA
4 2018-07-04 2018-07-02 NA NA NA NA NA
5 2018-07-10 2018-07-11
5 2018-07-12 2018-07-11 NA NA NA NA NA
6 2018-07-07 2018-07-10 NA
6 2018-07-07 2018-07-11 NA NA NA NA
6 2018-07-08 2018-07-10 NA NA
6 2018-07-08 2018-07-11 NA NA NA NA
@robinsones
Copy link
Contributor

Thanks Ramnath! I'll work on these this week.

@robinsones robinsones self-assigned this May 23, 2019
@robinsones
Copy link
Contributor

robinsones commented May 23, 2019

@ramnathv from my understanding of this function, you'd want to specify by_user and by_time in it, and the by argument would be the same as the by_user. Is that right? If so, it seems simpler to replace by with by_user and by_time

How I reproduced your result:

landed <- tibble::tribble(
  ~user_id, ~timestamp,
  1, "2018-07-01",
  2, "2018-07-01",
  3, "2018-07-02",
  4, "2018-07-01",
  4, "2018-07-04",
  5, "2018-07-10",
  5, "2018-07-12",
  6, "2018-07-07",
  6, "2018-07-08"
) %>%
  mutate(timestamp = as.Date(timestamp))

registered <- tibble::tribble(
  ~user_id, ~timestamp,
  1, "2018-07-02",
  3, "2018-07-02",
  4, "2018-06-10",
  4, "2018-07-02",
  5, "2018-07-11",
  6, "2018-07-10",
  6, "2018-07-11",
  7, "2018-07-07"
) %>%
  mutate(timestamp = as.Date(timestamp))

after_join_all(x = landed, 
               y = registered, 
               by = "user_id",
               by_user = "user_id", 
               by_time = "timestamp") 

@ramnathv
Copy link
Author

@robinsones That will work! Note that I am proposing this only as an internal utility function to help make clear the distinction of the output produced by the different variations, since it is a little nuanced. It will only be useful when working with toy datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants