Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck at 'Stage 1 - Document filtering' #68

Closed
peteruithoven opened this issue Aug 2, 2017 · 6 comments
Closed

Stuck at 'Stage 1 - Document filtering' #68

peteruithoven opened this issue Aug 2, 2017 · 6 comments
Assignees
Labels

Comments

@peteruithoven
Copy link
Contributor

I'm using couchdb-dump version: 1.1.7

I have a database, which is successfully downloaded to a file (39MB), but it get's stuck at Stage 1 - Document filtering.

... INFO: Output file bob.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38.2M    0 38.2M    0     0  10.0M      0 --:--:--  0:00:03 --:--:-- 10.0M
... INFO: File may contain Windows carridge returns- converting...
... INFO: Completed successfully.
... INFO: Amending file to make it suitable for Import.
... INFO: Stage 1 - Document filtering

Since it's below 250MB, the parsing isn't multi-threaded.

I'm assuming it's stuck at the sed line:

$sed_cmd ${sed_edit_in_place} 's/.*,"doc"://g'

Could someone what's the purpose of removing .*,"doc":? Is this the Database Compaction or Purge Historic and Deleted Data logic?

Looking into the json file, it removed the following part on each line.

{"id":"...","key":"...","value":{"rev":"..."},"doc":

I think a comment above that code is welcome.

I'm assuming my issue is caused by binary attachments in all the docs.

I don't think I'm helped with #31, since I do want this to happen.

@peteruithoven
Copy link
Contributor Author

Testing this just with that doc it seems like we can make that regular expression more performant by making the start more specific, using s/{"id".*,"doc"://g.
Does that make sense?

peteruithoven added a commit to peteruithoven/couchdb-dump that referenced this issue Aug 2, 2017
@dalgibbard
Copy link
Collaborator

Hey @peteruithoven - the initial raw export JSON from couchdb contains an encapsulating id/key/rev/doc section for each individual document within the database. To make the document importable back into couchdb, we need to strip this off; then the stage2 sed is to remove the leftover closing curly brace for this 'wrapping' section, and then sed sections 3 and 4 are to fix the header and footer of the JSON.
I think your proposed suggestion to make the sed statement more specific is valid and looks sane.

I don't have a means to test it right now- would you be able to confirm that the exported file is importable again using this change? And for the record, any detail of the speed improvement/time reduction as a result?

@dalgibbard dalgibbard self-assigned this Aug 2, 2017
@dalgibbard dalgibbard added the bug label Aug 2, 2017
@peteruithoven
Copy link
Contributor Author

Thanks for the clarifications.

I've exported all my databases with the altered script, removed them and then reimported them. I haven't found any issues so far.

In regards to speed, I've let old version work for > 10 minutes, on that 39MB file with no progress, I've not seen it finish at all. With my alteration, it takes maybe a few seconds.

dalgibbard pushed a commit that referenced this issue Aug 3, 2017
* Optimized stage 1 reg-exp

See:  #68

* Version bump to 1.1.8
@dalgibbard
Copy link
Collaborator

Merged and closed; thanks!

@peteruithoven
Copy link
Contributor Author

Thanks for checking and merging

@epos-eu
Copy link
Collaborator

epos-eu commented Aug 3, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants