-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store fewer objects in memory during import #7
Labels
Milestone
Comments
orangejulius
added a commit
that referenced
this issue
Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Fixes #7 Connects #94
orangejulius
added a commit
that referenced
this issue
Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
orangejulius
added a commit
that referenced
this issue
Aug 3, 2016
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
orangejulius
added a commit
that referenced
this issue
Aug 4, 2016
Previously, the WOF importer loaded all records into memory in one stream, and then processed and indexed the records in Elasticsearch in a second stream after the first stream was done. This has several problems: * It requires that all data can fit into memory. While this is not _so_ bad for WOF admin data, where a reasonably new machine can handle things just fine, it's horrible for venue data, where there are already 10s of millions of records. * Its slower: by separating the disk and network I/O sections, they can't be interleaved to speed things up. * It doesn't give good feedback when running the importer that something is happening: the importer sits for several minutes loading records before the dbclient progress logs start displaying This change fixes all those issues, by processing all records in a single stream, starting at the highest hierarchy level, and finishing at the lowest, so that all records always have the admin data they need to be processed. Fixes #101 Connects #7 Connects #94
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Right now this importer stores an object for every single WOF item in memory. While that object doesn't have all the fields from the WOF item in it (polygon data, most notably, isn't stored), it still takes up quite a bit of memory.
As WOF grows or this importer loads more fields, it will almost certainly brush up closer against the Node.js memory limit (which currently defaults to 2GB but is adjustable with a flag). It's probably close enough to it now that performance is suffering.
Thanks to #119, we can now import all venues, because only the hierarchy records have to stay in memory. For now there isn't enough data there to cause problems. But someday there might be.
When that happens we'll probably want to process items individually, and load all the required parent items on demand, without storing them, or perhaps only storing a certain number.
The text was updated successfully, but these errors were encountered: