Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor standard deviation calculation #147

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wagnerlmichael
Copy link
Member

@wagnerlmichael wagnerlmichael commented Feb 6, 2025

This PR showcases a potential refactor which reduces lines of code, and may improve readability. A potential cost to this refactor is modularity or computational cost (runtime). I've confirmed that we flag sales in the exact same way comparing outputs from this branch to the master.

Previously there had been 4 functions primarily responsible for the standard deviation calculations.

  • pricing_info
  • price_column
  • which_price
  • get_thresh

The standard deviation information had been held in a nested dictionary structure operated on with the get_thresh helper function. In this PR we switch to vectorized operations and remove the nested dictionary strategy , which included get_thresh.

Edit:
After testing on the same subset of data it seems like the proposed change decreases runtime for the overarching pricing_info function

Spec Runtime
Main 315.28s
PR Branch 255.78s

@wagnerlmichael wagnerlmichael marked this pull request as ready for review February 21, 2025 21:21
Copy link
Collaborator

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this refactor! Can you share the query you ran to check that these results are identical to the existing results? Once I've verified that, we'll be good to go.

Comment on lines +229 to +244
# Vectorized per-row lower and upper thresholds (mean ± std * multiplier)
for col in [f"sv_price_deviation_{group_str}", f"sv_cgdr_deviation_{group_str}"]:
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
0
] * df.groupby(list(groups))[col].transform("std")
df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[
1
] * df.groupby(list(groups))[col].transform("std")
if not condos:
df["sv_which_price"] = df.apply(which_price, args=(holds, groups), axis=1)

col = f"sv_price_per_sqft_deviation_{group_str}"
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
0
] * df.groupby(list(groups))[col].transform("std")
df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[
1
] * df.groupby(list(groups))[col].transform("std")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, non-blocking] Nice generalization! We could go one step further and fold the if not condos branch into the for loop that precedes it:

thresh_cols = [
    f"sv_price_deviation_{group_str}",
    f"sv_cgdr_deviation_{group_str}"
] + ([] if condos else ["f"sv_price_per_sqft_deviation_{group_str}"])

for col in thresh_cols:
    df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
        0
    ] * df.groupby(list(groups))[col].transform("std")
    ...

sqft_val = row[f"sv_price_per_sqft_deviation_{group_str}"]
sqft_lower = row[f"sv_price_per_sqft_deviation_{group_str}_lower"]
sqft_upper = row[f"sv_price_per_sqft_deviation_{group_str}_upper"]
sqft_out = sqft_val > sqft_upper or sqft_val < sqft_lower
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nitpick, non-blocking] Any reason not to use between_two_numbers() here, the way we do for the rest of the threshold checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants