To help you follow the data, the following section shows the overall data flow. This gives us an indication of the different I/O steps involved.
Main Route:
XLSForm > Kobo Dashboard > Raw Survey Data XLS > R Scripts
Sub Route 1:
R Scripts > Bivariate Statistics Table > PG Database for API use > Crosstabs Visualization API
Sub Route 2:
R Scripts > Univariate Statistics Table > PG Database for API use > Unvariate analysis visualization API (similar to ODP)
- We need to carefully design the input and output tables/JSONs/CSVs at different stages.
- Different people are involved at different stages. We need to synchronize our variables names and data formats efficiently.
- Ideally, we want no manual steps. Therefore, practically, we would want to minimize.
This section lists out all the things I did for data processing. This should aid us in future projects as well.
-
Rename variables: You should pay careful attention to the names you assign to different questions and option variables throughout your XLSForm (read about XLSForm here).
-
Tip 1: Have a system for naming that helps you easily call variables in R. When designing your own conventions, think about whether that convention is easy for others to understand as well.
As an example, in our worker's question, we've used suffixes (
i_
for impact,p_
for preparedness,m_
for metadata,o_
for outlook,b_
for baseline). We have also grouped similar columns( e.g.,_econ
for questions around economic effects of Covid19,_lvlhd_
for effects to workers' livelihoods. Finally we've tried using shorter names (likeshrt_names
). However, when naming variables, we've prioritized readability and specificity over name length (Good namep_econ_hhd_items_pre_covid
Bad name:p_e_h_pc
)
-
Tip 2: Don't fret too much. This is an iterative process. Note however, that once you go live, don't change variable names. You've got until then to experiment and come up with a usable, legible naming convention for your variables.
-
Tip 3: In the XLS form, the columns you need to work with are:
- "name" column in the "survey" sheet
- "list_name" column in the "choices" sheet
- Use appropriate download settings, here's what we use. Note that the "Group Seperator" - though disabled - is set to "__" (two underscores.) This is important and must be followed.
- Not much else, Bhawak uploads the file and deploys the survey here.
Nothing. This is the idea, to not play with the data file. Everything must be doneeiather before this stage or after this stage.
- ✔️ Generate tables for API use: a. univariate stats table b. Bivartiate stats table c. ...other? d. ❌ Is it exhaustive?
- ✔️ Finalize variable names for API use (i.e., finalize data format contract). The
keys
,values
,labels
, if set here, can simplify integration to Django. - Isolate single-select, multiselect variables: You have to go through the questionnaire and do this question by question.
- ❓ Think of what to do for branched variables
- ❓ Map variable names to respective labels in English and Nepali.
- More:
- write reusable, general functions
- simplify code
- use comments
- name properly
- etc.
- XLSForm can be downloaded from the Kobo dashboard.
- Raw Survey Data XLS can also be downloaded from the Kobo dashboard. Be sure to use the above shown settings.
- Workers survey form: https://ee.humanitarianresponse.info/x/RGuatGl6