-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor process parquet #124
base: croptype
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, after the desk discussion yesterday.
presto/utils.py
Outdated
takes into account updated start_date and end_date; available_timesteps | ||
holds the absolute number of timesteps that for which observations are | ||
- computing the number of available timesteps in the timeseries; | ||
it represents the absolute number of timesteps for which observations are | ||
available; it cannot be less than NUM_TIMESTEPS; if this is the case, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this now more flexible for number of timesteps if the latter is still imported from presto-worldcereal. Doesn't seem to be a flexible parameter?
presto/utils.py
Outdated
) | ||
index_columns.append("available_timesteps") | ||
|
||
# check for missing timestamps in the middle of timeseries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment belong to the actual code on the next line (being the concatenation of dataframes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is now removed b4964b0
@@ -314,6 +301,7 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame: | |||
df = pd.concat([df, dummy_df]) | |||
|
|||
# finally pivot the dataframe | |||
index_columns = list(np.unique(index_columns)) | |||
df_pivot = df.pivot(index=index_columns, columns="timestamp_ind", values=feature_columns) | |||
df_pivot = df_pivot.fillna(NODATAVALUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the actual filling of timestamps "in the middle" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is now actually obselete, because ranking function that is used to create timestamp_ind does not make any gaps that might require filling in. I removed it for now b4964b0
@@ -716,7 +696,7 @@ def prep_dataframe( | |||
# SAR cannot equal 0.0 since we take the log of it | |||
cols = [f"SAR-{s}-ts{t}-20m" for s in ["VV", "VH"] for t in range(36 if dekadal else 12)] | |||
|
|||
df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "end_date"]) | |||
df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "start_date"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this happen?
…inferred observations frequency
…es not have gaps; also fixed the indexing so that timestamp_ind starts with 0
["sample_id", "timestamp", "lat", "lon"]
Something to keep in mind: