Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R Reporting - API Scripts - functions to create loops for batched import of records #1

Open
sarahollis opened this issue Jul 27, 2021 · 1 comment

Comments

@sarahollis
Copy link
Collaborator

To avoid API timeouts, need to adapt scripts to loop through records according to pre-defined chunk sizes.
Example below of rationale:

chunk_size = 10000
data_instance <- pull_data_func()
base_url <- "url"
chunk_num <- 0
dat_out <- data_instance

while (nrow(data_instance) == chunk_size) {

temp_url <- paste0(base_url, "$skip=chunk_num$limit=", chunk_size)
data_instance <- get(temp_url)
data_out <- bind_rows(data_out, data_instance)
chunk_num <- chunk_num + chunk_size

}

@jamesfuller-cdc
Copy link
Collaborator

jamesfuller-cdc commented Aug 4, 2021

In the James branch, published 8/4/2021,

  1. get total number of records using the 'count' API endpoint
  2. create an empty data frame
  3. define your 'batch size'
  4. using a 'while' loop, iterate through and download records in batches, then append those new records to the existing data frame. The GET call must incorporate both limit and skip filters.

Remaining issues:
This still times out with extremely large datasets. Test instance has over 1 million follow-up records, and this iterative process still times out after 800 or 900 thousand records
. The resulting error code is 524, and I've reached out to Clarisoft via JIRA to get some help.

###################################################################################################
# GET CASES
###################################################################################################

#get total number of cases
cases_n <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases/count"), 
               add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
  content(as="text") %>% fromJSON(flatten=TRUE) %>% unlist() %>% unname()

#Import Cases in batches 
cases <- tibble()
batch_size <- 50000 # number of records to import per iteration
skip <-0
while (skip < cases_n) {
  message("********************************")
  message(paste0("Importing records ", as.character(skip+1, scientific = FALSE), " to ", format(skip+batch_size, scientific = FALSE)))
  cases.i <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases",
                      "/?filter={%22limit%22:",format(batch_size, scientific = FALSE),",%22skip%22:",format(skip, scientific = FALSE),"}"), 
               add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
    content(as='text') %>%
    fromJSON( flatten=TRUE) %>%
    as_tibble()
  message(paste0("Imported ", format(nrow(cases.i), scientific = FALSE)," records"))
  cases <- cases %>% bind_rows(cases.i)
  skip <- skip + batch_size
  message(paste0("Data Frame now has ", format(nrow(cases), scientific = FALSE), " records"))
  rm(cases.i)
}
rm(batch_size, skip, cases_n)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants