Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impractical for Large Databases #98

Open
Clashsoft opened this issue Dec 7, 2024 · 3 comments
Open

Impractical for Large Databases #98

Clashsoft opened this issue Dec 7, 2024 · 3 comments

Comments

@Clashsoft
Copy link

I tried to dump my CouchDB containing a replica of the npm registry. About 160 GB.

With the architecture of this script, it is borderline unusable. There are too many unnecessary space "optimizations" that lead to numerous temporary full copies of the file. On a HDD, this takes way too much time. Example:

    $echoVerbose && echo "... INFO: Stage 2 - Duplicate curly brace removal"
    # Approx 1Byte per line removed
    KBreduction=$((`wc -l ${file_name} | awk '{print$1}'` / 1024))

The wc -l alone takes more than 10 min at 200 MB/s. Why would I waste this time to remove (in this case) an insignificant 3 MB from the file size?

@dalgibbard
Copy link
Collaborator

It's been a long while since I looked at this script, but pretty sure the curly brace removal is needed in order for the resulting files to be valid for import.

However, the 'wc' call is just noting space reduction for reporting... Can probably just comment that out, but it's not going to be a quick operation either way...
I'm not that convinced that couchDB as an NPM reg replica is a good use case even tbh :) but that's not in scope here.

Sounds like you would be well placed to time each of the steps and report on which ones are negatively affecting the process, but have only pointed at one line which can be commented out so far (this isn't me complaining, just being specific about needed info :) )

What's your total export time? What time would you be expecting instead?

@Clashsoft
Copy link
Author

Thanks for the quick response :)

Yes, I'm am also not very happy with the fact that npm replica is done this way :) (I went through 5 or 6 different tools today, just to migrate this damn database to a new server. Of course regular replication does not work for other reasons...)

I cancelled the script after the initial curl request was done, because I saw what would happen in later steps. I had to do some other operation on the json and the initial file was good enough for that.

The curl _all_docs download took 2-3 hours, it was about 20-30 MB/s.
Ideally, all the processing would happen in between, by streaming and piping all the different steps together.
I know this would be a rather big undertaking especially considering it's a script.
Just wanted to share my comment for anyone who stumbles upon the tool in the future :)

@dalgibbard
Copy link
Collaborator

dalgibbard commented Dec 7, 2024

Makes sense, I think ultimately, any real improvement would be from moving it to a compiled language (eg. Golang) for improved in-stream processing and regex speed etc

Edit: though I wonder about the server-side speed limitations for _all_docs tbh too :)

Edit2: I'll make a side note somewhere to see if this is a project I can throw at an LLM to convert to golang for kicks when time allows lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants