-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impractical for Large Databases #98
Comments
It's been a long while since I looked at this script, but pretty sure the curly brace removal is needed in order for the resulting files to be valid for import. However, the 'wc' call is just noting space reduction for reporting... Can probably just comment that out, but it's not going to be a quick operation either way... Sounds like you would be well placed to time each of the steps and report on which ones are negatively affecting the process, but have only pointed at one line which can be commented out so far (this isn't me complaining, just being specific about needed info :) ) What's your total export time? What time would you be expecting instead? |
Thanks for the quick response :) Yes, I'm am also not very happy with the fact that npm replica is done this way :) (I went through 5 or 6 different tools today, just to migrate this damn database to a new server. Of course regular replication does not work for other reasons...) I cancelled the script after the initial curl request was done, because I saw what would happen in later steps. I had to do some other operation on the json and the initial file was good enough for that. The curl _all_docs download took 2-3 hours, it was about 20-30 MB/s. |
Makes sense, I think ultimately, any real improvement would be from moving it to a compiled language (eg. Golang) for improved in-stream processing and regex speed etc Edit: though I wonder about the server-side speed limitations for _all_docs tbh too :) Edit2: I'll make a side note somewhere to see if this is a project I can throw at an LLM to convert to golang for kicks when time allows lol |
I tried to dump my CouchDB containing a replica of the npm registry. About 160 GB.
With the architecture of this script, it is borderline unusable. There are too many unnecessary space "optimizations" that lead to numerous temporary full copies of the file. On a HDD, this takes way too much time. Example:
The
wc -l
alone takes more than 10 min at 200 MB/s. Why would I waste this time to remove (in this case) an insignificant 3 MB from the file size?The text was updated successfully, but these errors were encountered: