-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naive vs CHEER analysis measurement #168
Comments
At this point in time, the naive calculations have been successfully retrieved from the history. We have the columns in a data frame that represent naive calculations. We are making good progress towards the CHEER calculations by utilizing modules from e-mission-common, such as We are dropping rows from smart commute that fall under these criteria, as we are interested in comparing carbon calculations, so non-trips are not useful expanded_ct = expanded_ct.dropna(subset=['data_user_input_mode_confirm'])
expanded_ct = expanded_ct[expanded_ct['Mode_confirm'] != 'Not a Trip'] |
We have working functionality to compare the naive and CHEER calculations by outputting a graph of average CO2 emissions for each mode. This uses the smart commute data. Now that we have established this functionality, the intention is to move forward to bigger data sets. I have already started the process to get Bull eBike (Durham, NC) data access by communicating with the TSDC. Furthermore, we can already begin analysis of the whole CanBikeCO data instead of just one part (smart commute) Finally, the main objective is to compare not only across locations, but also across years. |
Rough timeline
|
|
Great question, that has not been implemented yet, but I'm sure we can, since the values are documented in the paper it shouldn't be too hard to multiply the value by the distance of the trip, it makes a lot of sense to show all three methods. @jpfleischer I don't think we really have the code for this like we did for the before and after on the public dashboard, I would look at the methods in the paper and do something like - hardcode the mode-val lookup, develop a function to look up the val give the mode and multiply it by the distance, and apply that to the dataframe.
I see what you're saying here, and this makes sense to me. I have been focused on wanting to capture "impact of adapting over time" and "impact of adapting by geography", but you're right that we don't have to compare geographies to do that, since by comparing naive and CHEER we are already showing the difference between "using Denver values" and "using local values".
Given some of this feedback, I think I would propose a few things:
|
The original coefficients are also documented in https://github.com/e-mission/e-mission-server/blob/b15fcb983c6b2f40e548f53550d417829a2f08fc/front/server/carbon_calc_details.html#L62 I think it should be as simple as using the new coefficients in the csv, and making sure to use the |
Does MassCEC not have any bus/train trips? |
I would add one regular and one log scale; the log scale one is just to show the e-bike results. You could put in the e-bike results from the three programs into one chart that is log scale. |
We can see the change in CHEER between the years in Colorado! |
Here is a preliminary plot for naive-naive (naive squared) method for CanBikeCO. The only values that were sensed were: {0: "unknown", 1: "walking",2: "bicycling", |
For the oldest method we're trying to evaluate - the data we currently have is "cleaned sections" which have a "sensed mode" which is one of the "MotionTypes", what we need is the "inferred sections" which have one of the "PredictedModeTypes" |
But that still doesn't explain why some of the data has a "sensed mode" that isn't an integer between 0 and 11. That's just a few sections (20 or so out of thousands) |
For the cumulative, maybe it would be easier to compare if we had % difference between Naive and cheer? As a table, that could be something like:
Or add it to the chart in some way, maybe to the top of the CHEER bar to clearly indicate it's offset from the Naive bar? |
I have the inferred csvs now. The sensed modes in those data are: Sanity check of average speeds:
Is it ok to make these assumptions about the correspondence between the paper values and the sensed modes? g_pkm = {
'Car': 172.78, # ICEV
'Train': 57.17,
'Subway': 57.17, # Treat Subway as Train
'Bus': 165.94,
'Air_Or_Hsr': 134.86, # is it ok to group hsr in too?
'Walking': 0,
'Bicycling': 0,
} |
If the mapping in the paper was confusing, you can view the original mapping (where I got what I put in the paper) here: https://github.com/e-mission/e-mission-server/blob/b15fcb983c6b2f40e548f53550d417829a2f08fc/front/server/carbon_calc_details.html#L69 What you put in the comment looks right to me though |
I think that this is a good point to start from. Where is the CanBikeCO? You would then want to dig deeper into this and explain why CHEER was lower for MassCEC and higher for Durham (maybe now you can split by mode, and then further split transit by fleet and occupancy) |
As a sanity check for each dataset, we take its inferred sections and its confirmed trips, and sum up the total distance covered for both. That way, when we find the (App, 2014) (naive naive) method which uses inferred sections, we want to make sure that we are only using inferred sections that are attached to a trip for fairness. However, CanBikeCOs subsets had 1 dataset whose section distance and trip distance numbers were not congruent. 4c had 46,713,000 for section, 46,518,000 for trips Does anyone know why CC has such a disparity? |
I would have to look at the data to know, do you think there's something different about the ids? Is the dataset shorter than expected? Maybe compare the number of rows between datasets? You'd need to use the total trips, not just the labeled ones, but if you get an idea of the ratio maybe you can tell if CC comes up short and there might be missing data? |
For some reason there are many trips without sections in CC # current_section is a dataframe of the inferred section. full_csv is a dataframe of the confirmed trips.
trip_id_column = 'data_trip_id' if 'data_trip_id' in current_section.columns else 'tripno'
all_trip_ids = set(full_csv["data_cleaned_trip"].unique())
matched_trip_ids = set(current_section[current_section[trip_id_column].isin(all_trip_ids)][trip_id_column].unique())
trips_without_sections = all_trip_ids - matched_trip_ids
num_trips_without_sections = len(trips_without_sections)
print(f"Total trips in full_csv: {len(all_trip_ids)}")
print(f"Trips in full_csv without any sections: {num_trips_without_sections}") Total trips in full_csv: 75154 This is very peculiar as the final number here is usually 0 or 2 or 3 for other sets. I will be rerequesting CC data |
The problem goes away when I use the cleaned sections instead of the inferred sections... I am continuing to use the abby_ceo folder for confirmed_trips. When I switch to using the inferred_sections for distance comparison, only CC/Boulder has a big discrepancy. When I switch to using cleaned_sections for distance comparison, only FC/Durango has a big discrepancy. |
If you have Do both |
You are right about the sensed mode. I can only use inferred, because only that one has the right values for sensed mode. I am going to request the entire dataset with all the types of csv's. |
Yes, I think you should summarize to one metric per user and then display for all users. 500 points is not too much to show in a chart as long as you are looking for patterns and not details. |
Yes, I would sort low to high and remove the x tick labels |
Is that a 500% change? Some of those seem awfully high; I think we will need a better explanation of the > 100% change entries. Also, if there are so many users with over 100% difference, why is the overall difference so low (I can think of many reasons, but we should figure it out and document it). |
What is the energy and emissions intensity for "Other" in CHEER? That doesn't seem right that it would be that much higher than the bus users. |
A mismatch is created because Dashboard 2020 (naive) is using This was accounted for previously in the figures that we created by dropping Air Detail in the paper that Dashboard considers these air values as 0 because there is no mapping for air in mode_labels whereas CHEER considers them as having actual value Theorized solution: |
So it does look like most of the discrepancy was for "Other" "air" trips. Which is what we thought was the case. I think this is an interesting figure, and we could include it in the revision/polishing of the paper! It highlights the impact using NTD data for calculations in a very visible way. In a much smaller way, we see the few negative bars which appear to be e-bike colored. I would suggest dropping modes that don't appear (namely the zero emissions modes) which might make it even easier to read what mode corresponds to what color. |
I wanted to note the cumulative emissions, which checks out in Figure 5.
|
Q: How many trips are over 100 MPH? Then how many of those are not in [air, train]? (do it for each program.)
|
What are the modes of the 48 trips in canbike co? Can you do something like groupby mode and count? That's odd the number is so high. |
Counts of trips over 100 MPH (excluding 'air' and 'train') in CanBikeCO, grouped by mode:
Also, the UACE issue is not an issue of the coordinates, but rather there are no agencies that publish data for that UACE, causing the UACE to become None. Even though so many of the dataset is in 09298 (Boulder), only one DR demand response transit agency publishes fuel data, and no buses. |
Ah, that makes lots of sense. So "occurred inside a UACE" vs "occurred inside a UACE where at least one agency published data for the requested mode" |
The buses in Boulder are run by the RTD (https://bouldercolorado.gov/services/bus). So it seems like this is a limitation of the NTD; it assumes that every transit agency only operates in one UACE. BART is listed with a UACE of 78904, for example, which is San Francisco--Oakland, CA but it runs upto Dublin/Pleasanton, which is in the Livermore-Dublin-Pleasanton (50533) UACE and Antioch, which is in the 02683 UACE. @JGreenlee does CHEER support multi-UACE agencies now? If so, how? |
From my perspective, CHEER does not support multi-UACE agencies because the NTD only has one UACE per agency.
The only issue would be, matching the names of agencies in NTD to the names of agencies in OpenStreetMap. |
That is definitely more compact! I think it might be useful to shorten the title of the NC program. I also wonder if this is the most useful version of this chart because CanBikeCO is so much higher than the others (I imagine that mostly has to do with longer collection period and/or more users). Depending on the purpose of the chart (how it is being used to support an argument in the paper) maybe it would be useful to display it differently. |
@Abby-Wheelis correct. Again, our focus is not on the carbon footprint, but on the difference between the baseline and the improved calculation method. I would suggest changing the chart to reflect that. |
Why are we using Wh/pkm for pkm is passenger km. So this is effectively "the coefficient". I thought that the point of having the two maps was to show the two components of "the coefficient" - the ridership and the fuel efficiency. So we don't want to double-count the people, right? Why isn't the fuel efficiency Wh/km or Wh/vkm to be even for explicit? |
Good start! I think the CHEER method is incomplete here - where is the energy use? |
I didn't think about it this way, but you're right, it would be better to not double count the people and to show the two components of the coefficient separately |
This is also not a correct representation of the Dashboard (2020) approach, which did use electrified fuels for the train and the e-bike |
The intention is to get quantified measurements and visualizations that compare naive method and CHEER method.
The naive method is in historical commits pre-#152
and #152 brings that CHEER calculation by using the e-mission-common module.
So using smart commute, we will read in a data frame and then use the functions to calculate the footprint in both ways and get the difference.
@Abby-Wheelis
EDIT: The notebook I wrote to accomplish the solution to this issue is available at #180
The text was updated successfully, but these errors were encountered: