Streamify import and process osm data task #1214

GabrielBruno24 · 2025-01-23T23:02:02Z

Change the import and process osm data task so that it reads the source files with an asynchronous stream. Also adds an option in one of the functions used by the task so that it will not halt if just one geojson is missing compared to the raw OSM data.

Add new classes inherent from DataGeojson and DataOsmRaw. Instead of reading the whole file at once, these classes will stream it piece by piece asynchronously, allowing for large files to be read without crashing the application.

tahini

Let me know if I am mistaken in my review, but I'm not sure the intended goal of streaming the operation works correctly.

packages/chaire-lib-common/src/tasks/dataImport/data/dataGeojson.ts

tahini · 2025-01-26T15:24:11Z

packages/chaire-lib-common/src/tasks/dataImport/data/dataGeojson.ts

+        console.log('Start streaming GeoJSON data.');
+        const readStream = fs.createReadStream(this._filename);
+        const jsonParser = JSONStream.parse('features.*');
+        const features: GeoJSON.Feature[] = [];


Here, you seem to fill the features array with the file content. If the file content is too big for memory, so will be the features array, no? Ideally, each feature should be "processed" (whatever processed means to the consumer of this class) and dropped after processing to avoid filling the memory.

No, because the main original limitation was that the whole file was put into a big string and this had a max size.

Yes, this way takes a lot of memory, but Node seem to be able to cope since it's in a small chunk. We will need to refactor that further, but I don't think we have time for that at the moment.

Ok, so it is less limited than it used to be and an intermediary step for later when all is in postgis? fair enough. then just fix the lowercase 'c' in the function name and it's good

We'll wait for confirmation from @GabrielBruno24 that it actually work on a region as big as Montreal.
Would be interesting to see if there's a limit.
(And maybe some stats on memory usage while it runs)

It takes forever because the time is O(N^2) but it does work if you give node enough memory yes. It would work better if all the write actions were part of a stream like I did with task 1 and 1b but the logic here is a lot more complex, so it would be better to just rewrite the function from scratch.

what is enough memory? (ballpark)

greenscientist

a few small things on my side

packages/chaire-lib-common/src/tasks/dataImport/data/dataGeojson.ts

greenscientist · 2025-01-27T14:23:53Z

packages/chaire-lib-common/src/tasks/dataImport/data/dataGeojson.ts

+        console.log('Start streaming GeoJSON data.');
+        const readStream = fs.createReadStream(this._filename);
+        const jsonParser = JSONStream.parse('features.*');
+        const features: GeoJSON.Feature[] = [];


what is enough memory? (ballpark)

packages/chaire-lib-common/src/tasks/dataImport/data/osmGeojsonService.ts

Modifies the function getGeojsonsFromRawData() so that it accepts a new option parameter, continueOnMissingGeojson. When generateNodesIfNotFound is false and continueOnMissingGeojson true, the function will just go to the next geojson and print a warning if a feature that is in the osm raw data is not in the geojson data. Previously, the only option was to throw an error and interrupt the process.

greenscientist · 2025-01-27T17:33:48Z

looks good to me

tahini · 2025-01-27T19:36:26Z

packages/chaire-lib-common/src/tasks/dataImport/data/dataGeojson.ts

+    // The proper way to do this would be to do it in getData(), but making that method async would force us to modify every class this inherits from, so we use this factory workaround.
+    // TODO: Rewrite the class from scratch so that it accepts an async getData().
+    static async create(filename: string): Promise<DataStreamGeojson> {
+        const instance = new DataStreamGeojson(filename);


Idéalement, avec l'approche factory, les données auraient pu être lues avant d'appeler le constructeur, ainsi l'objet aurait toujours été initialisé. Mais comme on va réécrire tout ça from scratch bientôt, ça va être correct. Le commentaire est plus pour une autre éventuelle factory que tu aurais à écrire un jour futur.

Streamify the 2_importAndProcessOsmDataToFiles task

a53af89

Add new classes inherent from DataGeojson and DataOsmRaw. Instead of reading the whole file at once, these classes will stream it piece by piece asynchronously, allowing for large files to be read without crashing the application.

GabrielBruno24 requested review from tahini, greenscientist and kaligrafy January 23, 2025 23:02

tahini requested changes Jan 26, 2025

View reviewed changes

greenscientist requested changes Jan 27, 2025

View reviewed changes

GabrielBruno24 force-pushed the streamifyImportAndProcessOsmDataTask branch from e4072fd to 218486a Compare January 27, 2025 17:23

GabrielBruno24 requested review from greenscientist and tahini January 27, 2025 17:23

greenscientist approved these changes Jan 27, 2025

View reviewed changes

tahini approved these changes Jan 27, 2025

View reviewed changes

tahini merged commit d673432 into chairemobilite:main Jan 27, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamify import and process osm data task #1214

Streamify import and process osm data task #1214

GabrielBruno24 commented Jan 23, 2025

tahini left a comment

tahini Jan 26, 2025

greenscientist Jan 26, 2025

tahini Jan 26, 2025

greenscientist Jan 26, 2025 •

edited

Loading

GabrielBruno24 Jan 26, 2025

greenscientist Jan 27, 2025

GabrielBruno24 Jan 27, 2025

greenscientist left a comment

greenscientist Jan 27, 2025

greenscientist commented Jan 27, 2025

tahini Jan 27, 2025

Streamify import and process osm data task #1214

Streamify import and process osm data task #1214

Conversation

GabrielBruno24 commented Jan 23, 2025

tahini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greenscientist Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greenscientist left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greenscientist commented Jan 27, 2025

Choose a reason for hiding this comment

greenscientist Jan 26, 2025 •

edited

Loading