You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
right now we store each tracker file as .parquet file at the end of script 1 and 2. This means 2 unnecessary writes and 2 unnecessary reads for the whole pipeline, hence a huge performance decrease due to a lot of I/O operations.
As we have quite a stable pipeline by now, we should remove this entirely and keep the data in memory between the different pipeline stages. This requires some rewriting of the existing scripts because the scripts assume that they read in the data from the output of the previous script. We should change this to passing the data to the next script, so instead of
all data is processed by first script
all data is processed by second script
we need a proper pipeline that
moves each file through the first script and then to the second script
so instead of processing all files twice, we have only a single loop like
for tracker_file in tracker_files:
raw_data = script_1(tracker_file)
cleaned_data = script_2(raw_data)
The third script is not big and needs all data available so this script will be unchanged.
We should introduce an optional parameter like export_data=F that can be set to TRUE to enable writing parquet files to the disk, if needed.
The text was updated successfully, but these errors were encountered:
right now we store each tracker file as
.parquet
file at the end of script 1 and 2. This means 2 unnecessary writes and 2 unnecessary reads for the whole pipeline, hence a huge performance decrease due to a lot of I/O operations.As we have quite a stable pipeline by now, we should remove this entirely and keep the data in memory between the different pipeline stages. This requires some rewriting of the existing scripts because the scripts assume that they read in the data from the output of the previous script. We should change this to passing the data to the next script, so instead of
we need a proper pipeline that
so instead of processing all files twice, we have only a single loop like
The third script is not big and needs all data available so this script will be unchanged.
We should introduce an optional parameter like
export_data=F
that can be set to TRUE to enable writing parquet files to the disk, if needed.The text was updated successfully, but these errors were encountered: