Refactor process parquet #124

cbutsko · 2024-12-11T13:48:08Z

now process_parquet can also work with dekadal inputs
introduced the minimum list of required columns - ["sample_id", "timestamp", "lat", "lon"]
other index columns used for pivoting are formed dynamically depending on what is available in the dataframe
checks done for valid_time variable are now optional (e.g., checking if valid_time is outside of available observations range or too close to the edge)

Something to keep in mind:

adding frequency of observations as explicit argument to the function
then we can avoid inferring frequency 5d93c39
and also re-introduce sanity checks of having all timestamps between first and last observation present, leaving no big gaps between consequent observations b4964b0

kvantricht

Added some comments, after the desk discussion yesterday.

presto/utils.py

kvantricht · 2024-12-12T09:33:04Z

presto/utils.py

-      takes into account updated start_date and end_date; available_timesteps
-      holds the absolute number of timesteps that for which observations are
+    - computing the number of available timesteps in the timeseries;
+      it represents the absolute number of timesteps for which observations are
      available; it cannot be less than NUM_TIMESTEPS; if this is the case,


How is this now more flexible for number of timesteps if the latter is still imported from presto-worldcereal. Doesn't seem to be a flexible parameter?

presto/utils.py

kvantricht · 2024-12-13T07:29:54Z

presto/utils.py

    )
+    index_columns.append("available_timesteps")

    # check for missing timestamps in the middle of timeseries


Does this comment belong to the actual code on the next line (being the concatenation of dataframes)?

this part is now removed b4964b0

kvantricht · 2024-12-13T07:30:10Z

presto/utils.py

@@ -314,6 +301,7 @@ def process_parquet(df: pd.DataFrame) -> pd.DataFrame:
        df = pd.concat([df, dummy_df])

    # finally pivot the dataframe
+    index_columns = list(np.unique(index_columns))
    df_pivot = df.pivot(index=index_columns, columns="timestamp_ind", values=feature_columns)
    df_pivot = df_pivot.fillna(NODATAVALUE)


this is the actual filling of timestamps "in the middle" ?

this part is now actually obselete, because ranking function that is used to create timestamp_ind does not make any gaps that might require filling in. I removed it for now b4964b0

presto/utils.py

kvantricht · 2024-12-13T07:33:26Z

presto/utils.py

@@ -716,7 +696,7 @@ def prep_dataframe(
    # SAR cannot equal 0.0 since we take the log of it
    cols = [f"SAR-{s}-ts{t}-20m" for s in ["VV", "VH"] for t in range(36 if dekadal else 12)]

-    df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "end_date"])
+    df = df.drop_duplicates(subset=["sample_id", "lat", "lon", "start_date"])


why does this happen?

…o function

…inferred observations frequency

…es not have gaps; also fixed the indexing so that timestamp_ind starts with 0

Butsko Christina added 4 commits December 11, 2024 09:14

avoid pandas to_datetime for 12-timesteps case

97ebb7d

more generic handling of ts indexing

3293e75

refactoring process_parquet

2d2aa2d

using all non-feature columns as index; formatting

b6cd500

cbutsko requested a review from kvantricht December 11, 2024 13:48

cbutsko changed the base branch from main to croptype December 11, 2024 13:48

Butsko Christina added 2 commits December 11, 2024 14:54

removing lat-lon from required columns

aac145e

formatting

d88124a

cbutsko mentioned this pull request Dec 12, 2024

Refactor process_parquet #122

Open

3 tasks

kvantricht requested changes Dec 13, 2024

View reviewed changes

Butsko Christina added 8 commits December 13, 2024 10:44

better wording in docustring

ad5f373

made num_timesteps and min_edge_buffer arguments that can be passed t…

81b5092

…o function

added latlon and DEM columns to required columns

0aaa9b6

neater handling of valid_date/time dichotomy

5d85e5a

replaced month-based dates imputing with a more generic one based on …

5d93c39

…inferred observations frequency

removed irrelevant check of missing indices since ranking function do…

b4964b0

…es not have gaps; also fixed the indexing so that timestamp_ind starts with 0

avoiding numpy overhead

0b9cf8c

putting filtering under the if

74c9ae6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor process parquet #124

Refactor process parquet #124

cbutsko commented Dec 11, 2024 •

edited

Loading

kvantricht left a comment

kvantricht Dec 12, 2024

kvantricht Dec 13, 2024

cbutsko Dec 13, 2024

kvantricht Dec 13, 2024

cbutsko Dec 13, 2024

kvantricht Dec 13, 2024

Refactor process parquet #124

Are you sure you want to change the base?

Refactor process parquet #124

Conversation

cbutsko commented Dec 11, 2024 • edited Loading

kvantricht left a comment

Choose a reason for hiding this comment

kvantricht Dec 12, 2024

Choose a reason for hiding this comment

kvantricht Dec 13, 2024

Choose a reason for hiding this comment

cbutsko Dec 13, 2024

Choose a reason for hiding this comment

kvantricht Dec 13, 2024

Choose a reason for hiding this comment

cbutsko Dec 13, 2024

Choose a reason for hiding this comment

kvantricht Dec 13, 2024

Choose a reason for hiding this comment

cbutsko commented Dec 11, 2024 •

edited

Loading