Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor process parquet #124

Open
wants to merge 14 commits into
base: croptype
Choose a base branch
from
Open

Refactor process parquet #124

wants to merge 14 commits into from

Conversation

cbutsko
Copy link

@cbutsko cbutsko commented Dec 11, 2024

  • now process_parquet can also work with dekadal inputs
  • introduced the minimum list of required columns - ["sample_id", "timestamp", "lat", "lon"]
  • other index columns used for pivoting are formed dynamically depending on what is available in the dataframe
  • checks done for valid_time variable are now optional (e.g., checking if valid_time is outside of available observations range or too close to the edge)

Something to keep in mind:

  • adding frequency of observations as explicit argument to the function
  • then we can avoid inferring frequency 5d93c39
  • and also re-introduce sanity checks of having all timestamps between first and last observation present, leaving no big gaps between consequent observations b4964b0

@cbutsko cbutsko requested a review from kvantricht December 11, 2024 13:48
@cbutsko cbutsko changed the base branch from main to croptype December 11, 2024 13:48
@cbutsko cbutsko mentioned this pull request Dec 12, 2024
3 tasks
Copy link
Contributor

@kvantricht kvantricht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, after the desk discussion yesterday.

presto/utils.py Outdated Show resolved Hide resolved
presto/utils.py Outdated
takes into account updated start_date and end_date; available_timesteps
holds the absolute number of timesteps that for which observations are
- computing the number of available timesteps in the timeseries;
it represents the absolute number of timesteps for which observations are
available; it cannot be less than NUM_TIMESTEPS; if this is the case,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this now more flexible for number of timesteps if the latter is still imported from presto-worldcereal. Doesn't seem to be a flexible parameter?

presto/utils.py Outdated Show resolved Hide resolved
presto/utils.py Outdated Show resolved Hide resolved
presto/utils.py Outdated Show resolved Hide resolved
presto/utils.py Outdated Show resolved Hide resolved
presto/utils.py Outdated
)
index_columns.append("available_timesteps")

# check for missing timestamps in the middle of timeseries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment belong to the actual code on the next line (being the concatenation of dataframes)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is now removed b4964b0

@@ -314,6 +301,7 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
df = pd.concat([df, dummy_df])

# finally pivot the dataframe
index_columns = list(np.unique(index_columns))
df_pivot = df.pivot(index=index_columns, columns="timestamp_ind", values=feature_columns)
df_pivot = df_pivot.fillna(NODATAVALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual filling of timestamps "in the middle" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is now actually obselete, because ranking function that is used to create timestamp_ind does not make any gaps that might require filling in. I removed it for now b4964b0

presto/utils.py Outdated Show resolved Hide resolved
@@ -716,7 +696,7 @@ def prep_dataframe(
# SAR cannot equal 0.0 since we take the log of it
cols = [f"SAR-{s}-ts{t}-20m" for s in ["VV", "VH"] for t in range(36 if dekadal else 12)]

df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "end_date"])
df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "start_date"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants