-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: data recovery tool #82
Comments
Sorry for the slow response. A data recovery tool does seem useful. I would consider accepting the tool as a contribution. But If there is an issue with BuntDB I would personally like to fix that problem ahead of adding a tool. You mention:
If the panic is ignored, such that a Go
I'm hoping to reproduce this issue. Is there any way to construct a program that causes this issue using the exotic storage layer? |
It was the Gluster Native Client. In principle I think this condition can arise with any networked storage layer:
|
This commit wraps the errors that could, under rare a occasion, cause a panic. Allowing for a graceful recover by select conditions. Please see #82 for more info
I added allowing for a recover mechanism. Would something like this work?
|
I think I was hoping for something compatible with errors.Is? Is there an obstacle to that? |
The Gluster corruption happened again (still no logs AFAIK), so I did a first pass at a recovery tool here: |
As the maintainer of Ergo, which uses BuntDB for storage, I've gotten two reports of corrupted databases. It is not clear to me in either case that the root cause of the corruption is a bug in Ergo or BuntDB --- I suspect some combination of operator error and instability in the filesystem/storage layer that BuntDB can't recover from. What's interesting is that in both cases it was possible to repair the corrupted database by hand. Here are the two examples (in both cases, sensitive user data has been redacted, changing the lengths of key and value fields --- these are just illustrations of the type of corruption that occurred.)
The first example looks to me like operator error: it looks like some writes are interleaved out of order, with a mysterious blank line interpolated:
The second example looks to me like failure to truncate(2) in response to a partial write (Ergo will log and ignore this panic by default --- unfortunately the logs are no longer available to confirm this theory). In this case, we know that the application was running on a relatively exotic storage layer. We see a partial write of an incomplete JSON object, followed by a
*3
that tries to set a different key:Two data points isn't that much to go on, but it seems like there may be a class of data corruption cases that can be partially corrected by skipping over a damaged section of the file. In such cases it would seem desirable to have a tool (a separate executable, probably) that can skip those sections, report on the apparent extent of the damage, and produce a new, valid file. (In both of the cases here, there would have been no actual data loss because the damaged values were subsequently overwritten with new values.)
Thanks very much for your time.
The text was updated successfully, but these errors were encountered: