You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
get total number of records using the 'count' API endpoint
create an empty data frame
define your 'batch size'
using a 'while' loop, iterate through and download records in batches, then append those new records to the existing data frame. The GET call must incorporate both limit and skip filters.
Remaining issues:
This still times out with extremely large datasets. Test instance has over 1 million follow-up records, and this iterative process still times out after 800 or 900 thousand records. The resulting error code is 524, and I've reached out to Clarisoft via JIRA to get some help.
###################################################################################################
# GET CASES
###################################################################################################
#get total number of cases
cases_n <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases/count"),
add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
content(as="text") %>% fromJSON(flatten=TRUE) %>% unlist() %>% unname()
#Import Cases in batches
cases <- tibble()
batch_size <- 50000 # number of records to import per iteration
skip <-0
while (skip < cases_n) {
message("********************************")
message(paste0("Importing records ", as.character(skip+1, scientific = FALSE), " to ", format(skip+batch_size, scientific = FALSE)))
cases.i <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases",
"/?filter={%22limit%22:",format(batch_size, scientific = FALSE),",%22skip%22:",format(skip, scientific = FALSE),"}"),
add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
content(as='text') %>%
fromJSON( flatten=TRUE) %>%
as_tibble()
message(paste0("Imported ", format(nrow(cases.i), scientific = FALSE)," records"))
cases <- cases %>% bind_rows(cases.i)
skip <- skip + batch_size
message(paste0("Data Frame now has ", format(nrow(cases), scientific = FALSE), " records"))
rm(cases.i)
}
rm(batch_size, skip, cases_n)
To avoid API timeouts, need to adapt scripts to loop through records according to pre-defined chunk sizes.
Example below of rationale:
chunk_size = 10000
data_instance <- pull_data_func()
base_url <- "url"
chunk_num <- 0
dat_out <- data_instance
while (nrow(data_instance) == chunk_size) {
temp_url <- paste0(base_url, "$skip=chunk_num$limit=", chunk_size)
data_instance <- get(temp_url)
data_out <- bind_rows(data_out, data_instance)
chunk_num <- chunk_num + chunk_size
}
The text was updated successfully, but these errors were encountered: