Paper Reproducibility Changes #102

Abby-Wheelis · 2023-11-27T22:19:31Z

As I am going through the charts in the paper to polish them up, I am also taking the time to organize, document, and check in the code used to produce those results. This maintains transparency for future researchers who might want to reproduce our results.

Added a DataFiltering Notebook:

This script compiles the data treatments performed by Cemal in his analysis notebook that generated many of the charts used in the paper.
The purpose of this script is to facilitate reproducibility of the results in our paper by taking in the raw set of trips in CSV, and applying all data treatments, then saving the results to be loaded into the analysis notebooks
Note that I have not yet had a chance to be sure that this works on the data from TSDC, but it does yield the numbers we quote in terms of participants and trips on an aggregate and program level when run in the raw file Cemal gave me

Analysis Notebooks: planning for 1 with non-spatial data and 1 to work with spatial data, will update as I make these changes, as the plan may change depending on the data formats.

Another note: much of this code is coming from a previous researcher who worked on the paper, Cemal Akcicek. My work is focused on organizing and polishing.

This script compiles the data treatments performed by Cemal in his analysis notebook that generated many of the charts used in the paper. The purpose of this script is to facilitate reproducibility of the results in our paper by taking in the raw set of trips in csv, and applying all data treatments, then saving the results to be loaded into the analysis notebooks Note that I have not yet had a chance to be sure that this works on the data from TSDC, but it does yield the numbers we quote in terms of participants and trips on an aggregate and program level when run in the raw file Cemal gave me

Abby-Wheelis · 2023-12-06T17:59:55Z

This is ending up being VERY tricky and confusing. The goal is to have the results and charts we show in the paper be 100% reproducible from the TSDC data -- open source data and script to allow for full transparency and reproducibility. This will not only benefit the credibility of this paper, but will hopefully lay the groundwork to make analysis of other, future, OpenPATH programs archived in the TSDC easy and accessible.

However, the data originally used to generate the paper is not the same as what TSDC will be providing. The column names in the csvs are almost all different, and the TSDC is redacting a fair bit of information. The problematic columns being redacted (so far) include Age (used for some analysis of the affect of age on e-bike usage) and trip timestamps (a key datapoint in calculating when someone's first e-bike trip was, and then cleaning out all data before that point).

I've been spinning my wheels for a little more than a week now, trying to reverse engineer a way to get the data from the TSDC cleaned and filtered in such a way that it matches the process we outline in the paper and the numbers that yielded. I'm hoping it will help to write out my thought processes a little bit more, which I've been doing in a bit more scattered manner but should really include here.

If I'm able to get the data to "match", the next hurdle is that a fair number of the charts rely on the data being loaded into the database, which I personally can't do because my computer can't process that much data at once without completely wigging out. But even if I could (there are some parts of the dataset that I won't need for this highish-level analysis, so I could clean them out and try loading a smaller subset of data) -- that's not the format that TSDC is providing, so then I'll need to A) reconcile the chart generating tactics with different data sourcing or B) wrestle the "matched" data into a zipped file compatible with the database loading

I'll keep updating here with my thoughts and progress as I have / continue to try different things

Abby-Wheelis · 2023-12-06T18:40:08Z

One issue I've encountered is the lack of Mode_confirm, etc columns in the TSDC data. @shankari pointed out that these columns come from the dictionaries in viz_scripts/auxillary_files and are leveraged by viz_scripts/scaffolding.py. I examined the code in scaffolding.py and ended up scraping the following lines from that file in order to use them in the data cleaning process I'm trying to develop:

#first, add the cleaned mode
data['Mode_confirm']= data['data_user_input_mode_confirm'].map(dic_re)

#second, add the cleaned replaced mode ASSUMES PROGRAM
data['Replaced_mode']= data['data_user_input_replaced_mode'].map(dic_re)

#third, add the cleaned purpose
data['Trip_purpose']= data['data_user_input_purpose_confirm'].map(dic_pur)

I did not just use the functions in scaffolding.py because they assume the database paradigm, and I am working from csvs

shankari · 2023-12-06T18:45:04Z

I am fine with this for now, but we should revisit this whole mapping when we re-do the energy/emissions work.

In general, we should implement scaffolding as an abstract class with two concrete subclasses. We already use something similar for the abstract timeseries, and since we use the base image anyway, we can also switch to using that standard approach and adding a new implementation instead of having 5 different implementations for data access.

Abstractions FTW!

Abby-Wheelis · 2023-12-06T18:52:12Z

Another issue, is the numbers not matching up. When I read in the TSDC data, I read it in program by program, matching the confirmed trips with the sociodemographic data then concatenating the merged data into the dataframe so that I end up with all of the merged data together. But the number of users in this dataset is less than I would expect.

As I'm accumulating the data, the programs have 13, 47, 29, 14, 14, and 9 users, for a total of 126 users, seems a little low for only having merged sociodemographic data and done no other cleaning, but is still > 122 so OK. But after all the data is put together there are only 112 unique IDs. This is a problem.

I'm worried that some of the users in different programs ended up with the same random ID as a result of the data cleaning process. I'm going to work on a way to prevent this - maybe appending the program name to the beginning of the id before I add that program to all the other programs.

Abby-Wheelis · 2023-12-06T18:57:35Z

I'm worried that some of the users in different programs ended up with the same random ID

Yep, I think this is exactly what happened, after adding the program to the end of id, the number after compiling the program when from 112 to the accurate 126. One step closer to finding the right data cleaning process!

BUT after just part of the filtering we're down to 118 users ... maybe the socio merging is off just a little somehow?

Abby-Wheelis · 2023-12-06T19:24:50Z

Adding more print statements revealed that there were 15 unique ids in the surveys, 14 unique ids in the trips, but then only 13 once they were merged. That continues for the other programs, dropping just 1-2 users per program. I wonder how that could be happening?

ghost users - people that logged in and filled a survey but never took any trips
errors in the merging code ... I feel like if there were id-matching issues the gap would be more than the 5ish total users

Best guess would be ghost users, but keeping this in mind.

Abby-Wheelis · 2023-12-06T19:56:03Z

Wait, ok, so the bigger concern is the dip between the number of users that have trips and the number of users that have trips and a survey -- you can't enter the app without filling out a survey, even if you say "wish not to say" for all of the responses, we still have a survey record for you.

@shankari is this correct? If so, then we might have a bigger problem because there are some places where, just reading in the csvs, there are less entries in the surveys csv than there are unique users in the trip dataset. I don't feel like that should be possible, should it?

This is saying there were 12 unique users in vail, 11 entries in the survey list, down to 9 after deduplication, so the number of vail users got dropped from 12 to 9 because 3 did not have a survey entry.

keeping git up to date as I update the paper, changes are messy but need to be kept

loading from the database / showing participation rates working on centralizing to as few scripts as possible

Abby-Wheelis · 2023-12-19T17:32:43Z

Yesterday I discovered that (at least part) of the problem was leading/trailing whitespaces on some of the userids, which got rid of the problem where I was randomly dropping users in the trip-survey merging process. I'm still working with the TSDC and my data cleaning scripts to verify an equivalent process of preparing the data to that which we used in the paper, when starting with TSDC data

mine is not currently working for some reason, erroring out over "within" in one part of the code but not others

very close to the TSDC numbers all but 8 of the charts are now in this Analysis file

Abby-Wheelis · 2023-12-23T00:25:23Z

New status update heading into the holidays:

My TSDC data work file intakes the files that the TSDC will have, and output similar numbers (same number of users, off by around 1,000 trips). TSDC data does still have some issues that I can ee:

pueblo county seems to be missing around 300 e-bike trips
community cycles has some messy data - off by one column messing up 8-10 trips

My Analysis script has almost all of the charts generated now:

added the labeling rate charts - reading from the TSDC now!
added more charts from Cemal's notebooks to centralize what's in the paper
spatial charts in their own notebook - not running
trip mode splits over distance not working great
timelines also somewhat broken - could be the data mistakes in CC - affecting timestamps
emissions charts I have not yet copied over

Left to address:

lingering TSDC issues [should have fixed data in the first few days of 2024]
spatial charts (2) [Denver's is a little off still]
mode charts (2)
time charts (2) [too many axis labels, and does not match the paper plot (could be the data)]
emissions charts (2) [waiting to see if data matches]

working through the energy calculations, might still need some updates to the data used as input until the paper is matched exactly

figured out what the underlying function could have been - was able to use the seaborn library

discovered some of the header code was no longer needed since we are working with csvs and not mongodump factored repeated code into a function

these have been replaced by the notebooks in the Abby folder, and are no longer needed as the newer notebooks are compatible with the TSDC data

implemented function, removed unnecessary code, centralized import statements

needed for spatial analysis notebook functions

after refactoring, the outputs now reflect the refined notebooks

This reverts commit 05398e2.

Abby-Wheelis · 2024-04-01T03:17:06Z

Large chunk of refactoring now done on this branch, based on commentary from @iantei on #118, this PR is now much smaller, with the elimination of older code and significant reduction in duplicate code. Changes fall in three folders:

muni_boundaries - the shapefiles used for spatial analysis
Abby - notebooks with code for processing TSDC data and generating each of the charts in the paper
viz_outputs - the data processing, spatial, and analysis charting notebooks

replaced by work in e-mission#102

work now in e-mission#102

these are the scripts that I got from Cemal, and what later work on the paper visualizations in e-mission#102 are based on

this way expected outputs can be viewed, separate from the code itself

Abby Wheelis added 10 commits December 7, 2023 13:11

information from Cemal

390c6fc

paper polish patch

20aae42

check in my analusis scripts

a537d4c

check in clustered bar chart

d47023b

more new analysis

8ca984f

checking in before mongo crashes everything

8b34b19

updates to scripts

3ee5e0c

keeping git up to date as I update the paper, changes are messy but need to be kept

check in notebooks

929a4a8

adding TSDC cleaning

7862ac7

add participation rates to vizualizations

60284f9

loading from the database / showing participation rates working on centralizing to as few scripts as possible

Abby Wheelis added 3 commits December 19, 2023 16:27

worked with spatial analysis notebooks

aa93422

mine is not currently working for some reason, erroring out over "within" in one part of the code but not others

progress against TSDC data - now missing exactly 1 user

e4c9866

updates to tsdc filtering and analysis

86f059e

very close to the TSDC numbers all but 8 of the charts are now in this Analysis file

Abby Wheelis and others added 7 commits December 28, 2023 12:56

added folders to organize the scripts

85812ef

Merge branch 'main' into paper-polish

437dc27

remove comitted chart

a931aa6

updated energy document

815a1bf

working through the energy calculations, might still need some updates to the data used as input until the paper is matched exactly

fixed mode charts

78dd5a6

figured out what the underlying function could have been - was able to use the seaborn library

principled bar graphs for spatial analysis

d671783

polishing round

e990fff

Abby Wheelis and others added 24 commits March 16, 2024 18:43

remove extra code, add a function

8a4c74b

discovered some of the header code was no longer needed since we are working with csvs and not mongodump factored repeated code into a function

create format_purpose_bars function

56cfe6f

create make_mini_vs_full function

a229ad7

correct typo

9907daa

merge cells and clarify variable name

9977e73

introduce make_occupation_chart chart

9b7077c

more logical var names

6eb7ad0

move functions into utilities.py

de70a94

introduce make_distribution_plot function

059bbd7

introduce make_stacked_bars and make_ebike_proportion_chart

450b306

comments and formatting in utility file

e464c34

reduce repeated code by introducing data utilities

ec9a809

remove file as duplicate of Abby/paper_utilities

4d9f838

remove as redundant to TSDC data notebook

086a2fc

update printed text to give context

7f0cae3

fix indentation

20dddeb

remove Cemal notebooks

1d12f1e

these have been replaced by the notebooks in the Abby folder, and are no longer needed as the newer notebooks are compatible with the TSDC data

refactor spatial analysis notebook

120eaf7

implemented function, removed unnecessary code, centralized import statements

remove outputs from energy notebook

a0efa18

commit needed shapefiles

dff57e0

needed for spatial analysis notebook functions

update output notebooks

34c1cae

after refactoring, the outputs now reflect the refined notebooks

debugging updates

736ab15

revert readme change

05398e2

Revert "revert readme change"

0b32ce7

This reverts commit 05398e2.

Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this pull request Apr 1, 2024

remove notebooks

75b529a

replaced by work in e-mission#102

Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this pull request Apr 1, 2024

remove notebooks

65372b5

work now in e-mission#102

Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this pull request Apr 1, 2024

check in Cemal's scripts

b41252f

these are the scripts that I got from Cemal, and what later work on the paper visualizations in e-mission#102 are based on

Abby-Wheelis mentioned this pull request Apr 1, 2024

check in Cemal's scripts #126

Open

add output version of tsdc data notebook

3daa1f3

this way expected outputs can be viewed, separate from the code itself

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper Reproducibility Changes #102

Paper Reproducibility Changes #102

Abby-Wheelis commented Nov 27, 2023

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023

shankari commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023 •

edited

Loading

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 19, 2023

Abby-Wheelis commented Dec 23, 2023 •

edited

Loading

Abby-Wheelis commented Apr 1, 2024

Paper Reproducibility Changes #102

Are you sure you want to change the base?

Paper Reproducibility Changes #102

Conversation

Abby-Wheelis commented Nov 27, 2023

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023

shankari commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023 • edited Loading

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 6, 2023

Abby-Wheelis commented Dec 19, 2023

Abby-Wheelis commented Dec 23, 2023 • edited Loading

Abby-Wheelis commented Apr 1, 2024

Abby-Wheelis commented Dec 6, 2023 •

edited

Loading

Abby-Wheelis commented Dec 23, 2023 •

edited

Loading