-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper Reproducibility Changes #102
base: main
Are you sure you want to change the base?
Conversation
This script compiles the data treatments performed by Cemal in his analysis notebook that generated many of the charts used in the paper. The purpose of this script is to facilitate reproducibility of the results in our paper by taking in the raw set of trips in csv, and applying all data treatments, then saving the results to be loaded into the analysis notebooks Note that I have not yet had a chance to be sure that this works on the data from TSDC, but it does yield the numbers we quote in terms of participants and trips on an aggregate and program level when run in the raw file Cemal gave me
This is ending up being VERY tricky and confusing. The goal is to have the results and charts we show in the paper be 100% reproducible from the TSDC data -- open source data and script to allow for full transparency and reproducibility. This will not only benefit the credibility of this paper, but will hopefully lay the groundwork to make analysis of other, future, OpenPATH programs archived in the TSDC easy and accessible. However, the data originally used to generate the paper is not the same as what TSDC will be providing. The column names in the csvs are almost all different, and the TSDC is redacting a fair bit of information. The problematic columns being redacted (so far) include Age (used for some analysis of the affect of age on e-bike usage) and trip timestamps (a key datapoint in calculating when someone's first e-bike trip was, and then cleaning out all data before that point). I've been spinning my wheels for a little more than a week now, trying to reverse engineer a way to get the data from the TSDC cleaned and filtered in such a way that it matches the process we outline in the paper and the numbers that yielded. I'm hoping it will help to write out my thought processes a little bit more, which I've been doing in a bit more scattered manner but should really include here. If I'm able to get the data to "match", the next hurdle is that a fair number of the charts rely on the data being loaded into the database, which I personally can't do because my computer can't process that much data at once without completely wigging out. But even if I could (there are some parts of the dataset that I won't need for this highish-level analysis, so I could clean them out and try loading a smaller subset of data) -- that's not the format that TSDC is providing, so then I'll need to A) reconcile the chart generating tactics with different data sourcing or B) wrestle the "matched" data into a zipped file compatible with the database loading I'll keep updating here with my thoughts and progress as I have / continue to try different things |
One issue I've encountered is the lack of
I did not just use the functions in |
I am fine with this for now, but we should revisit this whole mapping when we re-do the energy/emissions work. In general, we should implement scaffolding as an abstract class with two concrete subclasses. We already use something similar for the abstract timeseries, and since we use the base image anyway, we can also switch to using that standard approach and adding a new implementation instead of having 5 different implementations for data access. Abstractions FTW! |
Another issue, is the numbers not matching up. When I read in the TSDC data, I read it in program by program, matching the confirmed trips with the sociodemographic data then concatenating the merged data into the dataframe so that I end up with all of the merged data together. But the number of users in this dataset is less than I would expect. As I'm accumulating the data, the programs have 13, 47, 29, 14, 14, and 9 users, for a total of 126 users, seems a little low for only having merged sociodemographic data and done no other cleaning, but is still > 122 so OK. But after all the data is put together there are only 112 unique IDs. This is a problem. I'm worried that some of the users in different programs ended up with the same random ID as a result of the data cleaning process. I'm going to work on a way to prevent this - maybe appending the program name to the beginning of the id before I add that program to all the other programs. |
Yep, I think this is exactly what happened, after adding the program to the end of id, the number after compiling the program when from 112 to the accurate 126. One step closer to finding the right data cleaning process! BUT after just part of the filtering we're down to 118 users ... maybe the socio merging is off just a little somehow? |
Wait, ok, so the bigger concern is the dip between the number of users that have trips and the number of users that have trips and a survey -- you can't enter the app without filling out a survey, even if you say "wish not to say" for all of the responses, we still have a survey record for you. @shankari is this correct? If so, then we might have a bigger problem because there are some places where, just reading in the csvs, there are less entries in the surveys csv than there are unique users in the trip dataset. I don't feel like that should be possible, should it? This is saying there were 12 unique users in vail, 11 entries in the survey list, down to 9 after deduplication, so the number of vail users got dropped from 12 to 9 because 3 did not have a survey entry. |
keeping git up to date as I update the paper, changes are messy but need to be kept
loading from the database / showing participation rates working on centralizing to as few scripts as possible
Yesterday I discovered that (at least part) of the problem was leading/trailing whitespaces on some of the userids, which got rid of the problem where I was randomly dropping users in the trip-survey merging process. I'm still working with the TSDC and my data cleaning scripts to verify an equivalent process of preparing the data to that which we used in the paper, when starting with TSDC data |
mine is not currently working for some reason, erroring out over "within" in one part of the code but not others
very close to the TSDC numbers all but 8 of the charts are now in this Analysis file
New status update heading into the holidays: My TSDC data work file intakes the files that the TSDC will have, and output similar numbers (same number of users, off by around 1,000 trips). TSDC data does still have some issues that I can ee:
My Analysis script has almost all of the charts generated now:
Left to address:
|
working through the energy calculations, might still need some updates to the data used as input until the paper is matched exactly
figured out what the underlying function could have been - was able to use the seaborn library
discovered some of the header code was no longer needed since we are working with csvs and not mongodump factored repeated code into a function
these have been replaced by the notebooks in the Abby folder, and are no longer needed as the newer notebooks are compatible with the TSDC data
implemented function, removed unnecessary code, centralized import statements
needed for spatial analysis notebook functions
after refactoring, the outputs now reflect the refined notebooks
This reverts commit 05398e2.
Large chunk of refactoring now done on this branch, based on commentary from @iantei on #118, this PR is now much smaller, with the elimination of older code and significant reduction in duplicate code. Changes fall in three folders:
|
replaced by work in e-mission#102
work now in e-mission#102
these are the scripts that I got from Cemal, and what later work on the paper visualizations in e-mission#102 are based on
this way expected outputs can be viewed, separate from the code itself
As I am going through the charts in the paper to polish them up, I am also taking the time to organize, document, and check in the code used to produce those results. This maintains transparency for future researchers who might want to reproduce our results.
Added a DataFiltering Notebook:
Analysis Notebooks: planning for 1 with non-spatial data and 1 to work with spatial data, will update as I make these changes, as the plan may change depending on the data formats.
Another note: much of this code is coming from a previous researcher who worked on the paper, Cemal Akcicek. My work is focused on organizing and polishing.