Impractical for Large Databases #98

Clashsoft · 2024-12-07T17:56:26Z

I tried to dump my CouchDB containing a replica of the npm registry. About 160 GB.

With the architecture of this script, it is borderline unusable. There are too many unnecessary space "optimizations" that lead to numerous temporary full copies of the file. On a HDD, this takes way too much time. Example:

    $echoVerbose && echo "... INFO: Stage 2 - Duplicate curly brace removal"
    # Approx 1Byte per line removed
    KBreduction=$((`wc -l ${file_name} | awk '{print$1}'` / 1024))

The wc -l alone takes more than 10 min at 200 MB/s. Why would I waste this time to remove (in this case) an insignificant 3 MB from the file size?

The text was updated successfully, but these errors were encountered:

dalgibbard · 2024-12-07T19:32:15Z

It's been a long while since I looked at this script, but pretty sure the curly brace removal is needed in order for the resulting files to be valid for import.

However, the 'wc' call is just noting space reduction for reporting... Can probably just comment that out, but it's not going to be a quick operation either way...
I'm not that convinced that couchDB as an NPM reg replica is a good use case even tbh :) but that's not in scope here.

Sounds like you would be well placed to time each of the steps and report on which ones are negatively affecting the process, but have only pointed at one line which can be commented out so far (this isn't me complaining, just being specific about needed info :) )

What's your total export time? What time would you be expecting instead?

Clashsoft · 2024-12-07T20:57:17Z

Thanks for the quick response :)

Yes, I'm am also not very happy with the fact that npm replica is done this way :) (I went through 5 or 6 different tools today, just to migrate this damn database to a new server. Of course regular replication does not work for other reasons...)

I cancelled the script after the initial curl request was done, because I saw what would happen in later steps. I had to do some other operation on the json and the initial file was good enough for that.

The curl _all_docs download took 2-3 hours, it was about 20-30 MB/s.
Ideally, all the processing would happen in between, by streaming and piping all the different steps together.
I know this would be a rather big undertaking especially considering it's a script.
Just wanted to share my comment for anyone who stumbles upon the tool in the future :)

dalgibbard · 2024-12-07T21:07:28Z

Makes sense, I think ultimately, any real improvement would be from moving it to a compiled language (eg. Golang) for improved in-stream processing and regex speed etc

Edit: though I wonder about the server-side speed limitations for _all_docs tbh too :)

Edit2: I'll make a side note somewhere to see if this is a project I can throw at an LLM to convert to golang for kicks when time allows lol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impractical for Large Databases #98

Impractical for Large Databases #98

Clashsoft commented Dec 7, 2024

dalgibbard commented Dec 7, 2024

Clashsoft commented Dec 7, 2024

dalgibbard commented Dec 7, 2024 •

edited

Loading

Impractical for Large Databases #98

Impractical for Large Databases #98

Comments

Clashsoft commented Dec 7, 2024

dalgibbard commented Dec 7, 2024

Clashsoft commented Dec 7, 2024

dalgibbard commented Dec 7, 2024 • edited Loading

dalgibbard commented Dec 7, 2024 •

edited

Loading