Store fewer objects in memory during import #7

orangejulius · 2015-11-10T22:47:35Z

Right now this importer stores an object for every single WOF item in memory. While that object doesn't have all the fields from the WOF item in it (polygon data, most notably, isn't stored), it still takes up quite a bit of memory.

As WOF grows or this importer loads more fields, it will almost certainly brush up closer against the Node.js memory limit (which currently defaults to 2GB but is adjustable with a flag). It's probably close enough to it now that performance is suffering.

Thanks to #119, we can now import all venues, because only the hierarchy records have to stay in memory. For now there isn't enough data there to cause problems. But someday there might be.

When that happens we'll probably want to process items individually, and load all the required parent items on demand, without storing them, or perhaps only storing a certain number.

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Fixes #7 Connects #94

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94

riordan added data wof-related labels Jan 11, 2016

riordan modified the milestone: Who's on First Jan 11, 2016

riordan added the processed label Jan 13, 2016

riordan modified the milestones: Who's on First (Phase 2), Who's on First Jan 27, 2016

dianashk added Q2-2016 and removed processed labels Apr 27, 2016

orangejulius mentioned this issue Jun 13, 2016

Show progress when initially loading data #101

Closed

dianashk assigned orangejulius Jul 21, 2016

dianashk added Q3-2016 and removed Q2-2016 labels Jul 21, 2016

orangejulius mentioned this issue Jul 22, 2016

Import Who's on First venues #94

Closed

14 tasks

dianashk modified the milestones: WOF Venues, Who's on First (Phase 2) Jul 29, 2016

orangejulius mentioned this issue Aug 3, 2016

Use a single stream for importing records #119

Merged

orangejulius added in progress and removed Q3-2016 labels Aug 3, 2016

dianashk removed this from the WOF Venues milestone Aug 10, 2016

dianashk added processed and removed in progress labels Aug 10, 2016

orangejulius removed their assignment Jul 27, 2017

dianashk added this to the Importers milestone Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store fewer objects in memory during import #7

Store fewer objects in memory during import #7

orangejulius commented Nov 10, 2015 •

edited

Loading

Store fewer objects in memory during import #7

Store fewer objects in memory during import #7

Comments

orangejulius commented Nov 10, 2015 • edited Loading

orangejulius commented Nov 10, 2015 •

edited

Loading