-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implemented Full + Incremental export + purge + restore
Exporting and Purging entries in 1 hour time chunks into separate files with a defined start_ts and end_ts. Start_ts is the last_processed_ts for the user_id's Purge pipeline state. - If the pipeline has already been run for the user, then this would be a non-None value. - If pipeline hasn't been run, then it will be None; in this case, the earliest timestamp entry is chosen as the start ts. This helps avoid ValueErrors when adding 1 hour (3600 seconds) to start_ts for incremental export. End_ts differs for Full export and Incremental export: - Full export: current time at which pipeline is run will be the end ts; value returned by pipeline_queries on initiating pipeline stage is used. - Incremental export: First end_ts value would be 1 hour ahead of start_ts; this value would continue to be incremented by 1 hour as long as data exists for a user. If the value after adding 1 hour exceeds the current time, then the end_ts is set to the current time itself. The export + delete process continue as long as there is raw timeseries data for the user. ------- But what does 1 hour’s worth of data mean? - In any case, purge pipeline runs upto the current time or until no more raw timeseries entries present in db for the user. - If Purge pipeline running for the first time for a user, then it will export and purge all the timeseries data for a user from its first entry (which can be really old data and first purge run might take a lot of time) - If Purge pipeline has already been run before for a user, then it will set start_ts to last_processed_ts and export data from that point. - If purge pipeline run hourly, then it would eventually just have a small subset of entries. ------- Some points to consider: A. Numerous files; Less data quantity per file One observation is that current logic is creating multiple files in 1 hour chunks, which is okay. But these files don’t really have a lot of entries. What could be more efficient is to perhaps store more entries until a threshold say 5000 or 10000 (like batch_size in load_multi_timeline_for_range). If this default threshold batch size isn't reached, keep adding to the same file. Keeping updating the end_ts but start_ts would remain the same. Will attempt this next step. ------ B. Right approach for Export + Purge? Approach A 1. Export data in chunks to File 2. Delete exported data from DB. 3. Repeat until all data purged. Flow looks like: Export -> Delete -> Export -> Delete -> Export -> Delete —— Approach B 1. Export data in chunks to file. 2. Repeat until all data exported. 3. Delete all exported data from DB. Flow looks like: Export -> Export -> Export ... -> Delete --------------- C. Do we need all 3 types of entries: locations, trips, places? For now, commented out code from export_timeseries.py. If we only need location entries, can simplify code further to just work for these entries. If these are sort of like subsets of each other: location -> trip -> place. Then I can safely just take location. But this is valid only if all entries contain location and hence ts values. If only trip entries present or only place entries, then directly choosing latest ts is incorrect since trips use enter_ts while places use start_ts Searching codebase for references and read through Shankari’s thesis to start_ts and enter_ts. I’m getting hints that start_ts and enter_ts are analysis_timeseries entries? In that case, can ignore these since the purge + restore is concerned with raw timeseries data only. Trip entries created in emission/analysis/intake/segmentation/trip_segmentation.py —— Hint 1: Timeseries_Sample.ipynb - ct_df fetches analysis/cleaned_trip entries -> analysis DB ------ Hint 2: bin/historical/migrations/populate_local_dt.py - Looks like old code, some changes were last made 8 years ago. - The collection parameter refers to some non-time series databases as seen from the function calls. - The entry[start_ts] or entry[‘enter_ts’] values are then used in the find query by setting data.ts to this value. --------- D. Is pipeline_states export needed? Remove pipeline_states export if not needed. Currently being used in existing export + load scripts. ---------
- Loading branch information
Mahadik, Mukul Chandrakant
authored and
Mahadik, Mukul Chandrakant
committed
Aug 29, 2024
1 parent
661a222
commit ec162ad
Showing
6 changed files
with
138 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters