R Reporting - API Scripts - functions to create loops for batched import of records #1

sarahollis · 2021-07-27T14:30:30Z

To avoid API timeouts, need to adapt scripts to loop through records according to pre-defined chunk sizes.
Example below of rationale:

chunk_size = 10000
data_instance <- pull_data_func()
base_url <- "url"
chunk_num <- 0
dat_out <- data_instance

while (nrow(data_instance) == chunk_size) {

temp_url <- paste0(base_url, "$skip=chunk_num$limit=", chunk_size)
data_instance <- get(temp_url)
data_out <- bind_rows(data_out, data_instance)
chunk_num <- chunk_num + chunk_size

}

jamesfuller-cdc · 2021-08-04T17:51:49Z

In the James branch, published 8/4/2021,

get total number of records using the 'count' API endpoint
create an empty data frame
define your 'batch size'
using a 'while' loop, iterate through and download records in batches, then append those new records to the existing data frame. The GET call must incorporate both limit and skip filters.

Remaining issues:
This still times out with extremely large datasets. Test instance has over 1 million follow-up records, and this iterative process still times out after 800 or 900 thousand records. The resulting error code is 524, and I've reached out to Clarisoft via JIRA to get some help.

###################################################################################################
# GET CASES
###################################################################################################

#get total number of cases
cases_n <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases/count"), 
               add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
  content(as="text") %>% fromJSON(flatten=TRUE) %>% unlist() %>% unname()

#Import Cases in batches 
cases <- tibble()
batch_size <- 50000 # number of records to import per iteration
skip <-0
while (skip < cases_n) {
  message("********************************")
  message(paste0("Importing records ", as.character(skip+1, scientific = FALSE), " to ", format(skip+batch_size, scientific = FALSE)))
  cases.i <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases",
                      "/?filter={%22limit%22:",format(batch_size, scientific = FALSE),",%22skip%22:",format(skip, scientific = FALSE),"}"), 
               add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
    content(as='text') %>%
    fromJSON( flatten=TRUE) %>%
    as_tibble()
  message(paste0("Imported ", format(nrow(cases.i), scientific = FALSE)," records"))
  cases <- cases %>% bind_rows(cases.i)
  skip <- skip + batch_size
  message(paste0("Data Frame now has ", format(nrow(cases), scientific = FALSE), " records"))
  rm(cases.i)
}
rm(batch_size, skip, cases_n)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R Reporting - API Scripts - functions to create loops for batched import of records #1

R Reporting - API Scripts - functions to create loops for batched import of records #1

sarahollis commented Jul 27, 2021

jamesfuller-cdc commented Aug 4, 2021 •

edited by sarahollis

Loading

R Reporting - API Scripts - functions to create loops for batched import of records #1

R Reporting - API Scripts - functions to create loops for batched import of records #1

Comments

sarahollis commented Jul 27, 2021

jamesfuller-cdc commented Aug 4, 2021 • edited by sarahollis Loading

jamesfuller-cdc commented Aug 4, 2021 •

edited by sarahollis

Loading