Fix for model-build failure due to presence of survey inputs as a dic…

…tionary (e-mission#954) * Filtered out rows with dictionary in user_label_df The idea is to check the data type using isinstance() and then apply this check on the entire data frame as a whole instead of doing it iteratively on each row which is much slower. These rows are then filtered out of the original dataframe leaving behind only the non-dict rows. * Added log statements to indicate dataframe filtering done Added log statement to the greedy_similarity_binning, to indicate filtering is being done for the dictionary elements in the dataframe if the column is 'trip_user_input'. * Modified filtering of survey inputs Now filtering survey inputs before dataframe itself is created by checking whether each dictionary value is a value or a nested dictionary. * Add TODO so I can merge this --------- Co-authored-by: Mahadik, Mukul Chandrakant <[email protected]> Co-authored-by: K. Shankari <[email protected]>
MukuFlash03 · Feb 10, 2024 · 174bfb1 · 174bfb1
1 parent 911e1ec
commit 174bfb1
Showing 1 changed file with 5 additions and 1 deletion.
diff --git a/emission/analysis/modelling/trip_model/greedy_similarity_binning.py b/emission/analysis/modelling/trip_model/greedy_similarity_binning.py
@@ -288,7 +288,11 @@ def _generate_predictions(self):
         probability is estimated with label_count / total_labels.
         """
         for _, bin_record in self.bins.items():
-            user_label_df = pd.DataFrame(bin_record['labels'])
+            # TODO: Revisit after we have unified label and survey inputs (https://github.com/e-mission/e-mission-docs/issues/1045)
+            logging.debug("Filtering out any nested dictionaries from the list of dictionary labels")
+            filtered_label_dicts = [label_dict for label_dict in bin_record['labels'] if not any(isinstance(x, dict) for x in label_dict.values())]            
+            logging.debug("Number of entries after filtering changed %s -> %s" % (len(bin_record['labels']), len(filtered_label_dicts)))
+            user_label_df = pd.DataFrame(filtered_label_dicts)
             user_label_df = lp.map_labels(user_label_df).dropna()
             # compute the sum of trips in this cluster
             sum_trips = len(user_label_df)