Refactor standard deviation calculation #147

wagnerlmichael · 2025-02-06T21:36:01Z

This PR showcases a potential refactor which reduces lines of code, and may improve readability. A potential cost to this refactor is modularity or computational cost (runtime). I've confirmed that we flag sales in the exact same way comparing outputs from this branch to the master.

Previously there had been 4 functions primarily responsible for the standard deviation calculations.

pricing_info
price_column
which_price
get_thresh

The standard deviation information had been held in a nested dictionary structure operated on with the get_thresh helper function. In this PR we switch to vectorized operations and remove the nested dictionary strategy , which included get_thresh.

Edit:
After testing on the same subset of data it seems like the proposed change decreases runtime for the overarching pricing_info function

Spec	Runtime
Main	315.28s
PR Branch	255.78s

jeancochrane

I like this refactor! Can you share the query you ran to check that these results are identical to the existing results? Once I've verified that, we'll be good to go.

jeancochrane · 2025-02-24T22:29:55Z

glue/flagging_script_glue/flagging.py

+    # Vectorized per-row lower and upper thresholds (mean ± std * multiplier)
+    for col in [f"sv_price_deviation_{group_str}", f"sv_cgdr_deviation_{group_str}"]:
+        df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
+            0
+        ] * df.groupby(list(groups))[col].transform("std")
+        df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[
+            1
+        ] * df.groupby(list(groups))[col].transform("std")
    if not condos:
-        df["sv_which_price"] = df.apply(which_price, args=(holds, groups), axis=1)
-
+        col = f"sv_price_per_sqft_deviation_{group_str}"
+        df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
+            0
+        ] * df.groupby(list(groups))[col].transform("std")
+        df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[
+            1
+        ] * df.groupby(list(groups))[col].transform("std")


[Suggestion, non-blocking] Nice generalization! We could go one step further and fold the if not condos branch into the for loop that precedes it:

thresh_cols = [ f"sv_price_deviation_{group_str}", f"sv_cgdr_deviation_{group_str}" ] + ([] if condos else ["f"sv_price_per_sqft_deviation_{group_str}"]) for col in thresh_cols: df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[ 0 ] * df.groupby(list(groups))[col].transform("std") ...

jeancochrane · 2025-02-24T22:35:59Z

glue/flagging_script_glue/flagging.py

+        sqft_val = row[f"sv_price_per_sqft_deviation_{group_str}"]
+        sqft_lower = row[f"sv_price_per_sqft_deviation_{group_str}_lower"]
+        sqft_upper = row[f"sv_price_per_sqft_deviation_{group_str}_upper"]
+        sqft_out = sqft_val > sqft_upper or sqft_val < sqft_lower


[Nitpick, non-blocking] Any reason not to use between_two_numbers() here, the way we do for the rest of the threshold checks?

wagnerlmichael added 6 commits February 6, 2025 19:32

Mock up refactor

271924b

Refactor standard deviation price functionality

ff7b60e

Re add full doc strings

c08872d

Improve docstring

a6ce616

Remove get_thresh

8f7eea5

Fix spacing

89f515b

wagnerlmichael marked this pull request as ready for review February 21, 2025 21:21

jeancochrane approved these changes Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor standard deviation calculation #147

Refactor standard deviation calculation #147

wagnerlmichael commented Feb 6, 2025 •

edited

Loading

jeancochrane left a comment

jeancochrane Feb 24, 2025

jeancochrane Feb 24, 2025

Refactor standard deviation calculation #147

Are you sure you want to change the base?

Refactor standard deviation calculation #147

Conversation

wagnerlmichael commented Feb 6, 2025 • edited Loading

jeancochrane left a comment

Choose a reason for hiding this comment

jeancochrane Feb 24, 2025

Choose a reason for hiding this comment

jeancochrane Feb 24, 2025

Choose a reason for hiding this comment

wagnerlmichael commented Feb 6, 2025 •

edited

Loading