Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid explicit storage of intermediate files #122

Open
pmayd opened this issue Jan 2, 2024 · 0 comments
Open

avoid explicit storage of intermediate files #122

pmayd opened this issue Jan 2, 2024 · 0 comments

Comments

@pmayd
Copy link
Collaborator

pmayd commented Jan 2, 2024

right now we store each tracker file as .parquet file at the end of script 1 and 2. This means 2 unnecessary writes and 2 unnecessary reads for the whole pipeline, hence a huge performance decrease due to a lot of I/O operations.

As we have quite a stable pipeline by now, we should remove this entirely and keep the data in memory between the different pipeline stages. This requires some rewriting of the existing scripts because the scripts assume that they read in the data from the output of the previous script. We should change this to passing the data to the next script, so instead of

  • all data is processed by first script
  • all data is processed by second script

we need a proper pipeline that

  • moves each file through the first script and then to the second script

so instead of processing all files twice, we have only a single loop like

for tracker_file in tracker_files:
    raw_data = script_1(tracker_file)
    cleaned_data = script_2(raw_data)

The third script is not big and needs all data available so this script will be unchanged.

We should introduce an optional parameter like export_data=F that can be set to TRUE to enable writing parquet files to the disk, if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant