avoid explicit storage of intermediate files #122

pmayd · 2024-01-02T20:41:43Z

right now we store each tracker file as .parquet file at the end of script 1 and 2. This means 2 unnecessary writes and 2 unnecessary reads for the whole pipeline, hence a huge performance decrease due to a lot of I/O operations.

As we have quite a stable pipeline by now, we should remove this entirely and keep the data in memory between the different pipeline stages. This requires some rewriting of the existing scripts because the scripts assume that they read in the data from the output of the previous script. We should change this to passing the data to the next script, so instead of

all data is processed by first script
all data is processed by second script

we need a proper pipeline that

moves each file through the first script and then to the second script

so instead of processing all files twice, we have only a single loop like

for tracker_file in tracker_files:
    raw_data = script_1(tracker_file)
    cleaned_data = script_2(raw_data)

The third script is not big and needs all data available so this script will be unchanged.

We should introduce an optional parameter like export_data=F that can be set to TRUE to enable writing parquet files to the disk, if needed.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid explicit storage of intermediate files #122

avoid explicit storage of intermediate files #122

pmayd commented Jan 2, 2024 •

edited

Loading

avoid explicit storage of intermediate files #122

avoid explicit storage of intermediate files #122

Comments

pmayd commented Jan 2, 2024 • edited Loading

pmayd commented Jan 2, 2024 •

edited

Loading