-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor standard deviation calculation #147
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this refactor! Can you share the query you ran to check that these results are identical to the existing results? Once I've verified that, we'll be good to go.
# Vectorized per-row lower and upper thresholds (mean ± std * multiplier) | ||
for col in [f"sv_price_deviation_{group_str}", f"sv_cgdr_deviation_{group_str}"]: | ||
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[ | ||
0 | ||
] * df.groupby(list(groups))[col].transform("std") | ||
df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[ | ||
1 | ||
] * df.groupby(list(groups))[col].transform("std") | ||
if not condos: | ||
df["sv_which_price"] = df.apply(which_price, args=(holds, groups), axis=1) | ||
|
||
col = f"sv_price_per_sqft_deviation_{group_str}" | ||
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[ | ||
0 | ||
] * df.groupby(list(groups))[col].transform("std") | ||
df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[ | ||
1 | ||
] * df.groupby(list(groups))[col].transform("std") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion, non-blocking] Nice generalization! We could go one step further and fold the if not condos
branch into the for loop that precedes it:
thresh_cols = [
f"sv_price_deviation_{group_str}",
f"sv_cgdr_deviation_{group_str}"
] + ([] if condos else ["f"sv_price_per_sqft_deviation_{group_str}"])
for col in thresh_cols:
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
0
] * df.groupby(list(groups))[col].transform("std")
...
sqft_val = row[f"sv_price_per_sqft_deviation_{group_str}"] | ||
sqft_lower = row[f"sv_price_per_sqft_deviation_{group_str}_lower"] | ||
sqft_upper = row[f"sv_price_per_sqft_deviation_{group_str}_upper"] | ||
sqft_out = sqft_val > sqft_upper or sqft_val < sqft_lower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nitpick, non-blocking] Any reason not to use between_two_numbers()
here, the way we do for the rest of the threshold checks?
This PR showcases a potential refactor which reduces lines of code, and may improve readability. A potential cost to this refactor is modularity or computational cost (runtime). I've confirmed that we flag sales in the exact same way comparing outputs from this branch to the master.
Previously there had been 4 functions primarily responsible for the standard deviation calculations.
pricing_info
price_column
which_price
get_thresh
The standard deviation information had been held in a nested dictionary structure operated on with the
get_thresh
helper function. In this PR we switch to vectorized operations and remove the nested dictionary strategy , which includedget_thresh
.Edit:
After testing on the same subset of data it seems like the proposed change decreases runtime for the overarching
pricing_info
function