Extraction continues even when database is unavailable #187

frankier · 2020-07-10T12:26:37Z

frankier
Jul 10, 2020

On the development branch extraction continues even when database is unavailable. I'm not sure, but I think ideally the extraction process should just abort in this case to avoid outputting files which are not indexed. Also if the database becomes unreachable part way through. This way (hopefully) a partial dump can be restored.

Here are some illustrative logs:

2020-07-10 12:22:23.367 [pool-3-thread-51] ERROR o.v.c.c.d.c.CottontailWrapper - Error occurred during invocation of CottontailWrapper.query: UNAVAILABLE: io exception
2020-07-10 12:22:23.386 [pool-3-thread-55] ERROR o.v.c.c.d.c.CottontailWrapper - Error occurred during invocation of CottontailWrapper.query: UNAVAILABLE: io exception
2020-07-10 12:22:23.386 [pool-3-thread-36] ERROR o.v.c.c.d.c.CottontailWrapper - Error occurred during invocation of CottontailWrapper.query: UNAVAILABLE: io exception
2020-07-10 12:22:23.386 [pool-3-thread-21] ERROR o.v.c.c.d.c.CottontailWrapper - Error occurred during invocation of CottontailWrapper.query: UNAVAILABLE: io exception
2020-07-10 12:22:23.386 [pool-3-thread-31] ERROR o.v.c.c.d.c.CottontailWrapper - Error occurred during invocation of CottontailWrapper.query: UNAVAILABLE: io exception
2020-07-10 12:22:23.387 [pool-3-thread-52] ERROR o.v.c.c.d.c.CottontailWrapper - Error occurred during invocation of CottontailWrapper.query: UNAVAILABLE: io exception
2020-07-10 12:22:23.400 [pool-3-thread-55] DEBUG o.v.c.s.r.ExtractionTask - Finished CLD on segmentID v_0000001_69 in 1026 ms
2020-07-10 12:22:23.422 [pool-3-thread-21] DEBUG o.v.c.s.r.ExtractionTask - Finished EdgeGrid16 on segmentID v_0000001_69 in 1049 ms
2020-07-10 12:22:23.440 [pool-3-thread-36] DEBUG o.v.c.s.r.ExtractionTask - Finished EdgeARP88 on segmentID v_0000001_69 in 1066 ms
2020-07-10 12:22:23.608 [pool-3-thread-33] DEBUG o.v.c.s.r.ExtractionTask - Finished AverageFuzzyHist on segmentID v_0000001_69 in 1234 ms
2020-07-10 12:22:23.643 [pool-3-thread-13] DEBUG o.v.c.s.r.ExtractionTask - Finished AverageColorGrid8 on segmentID v_0000001_69 in 1270 ms
2020-07-10 12:22:23.671 [pool-3-thread-35] DEBUG o.v.c.s.r.ExtractionTask - Finished AverageColorARP44 on segmentID v_0000001_69 in 1298 ms
2020-07-10 12:22:23.834 [pool-3-thread-27] DEBUG o.v.c.s.r.ExtractionTask - Finished AverageColorGrid8Reduced15 on segmentID v_0000001_69 in 1461 ms
2020-07-10 12:22:24.449 [pool-3-thread-31] DEBUG o.v.c.s.r.ExtractionTask - Finished EHD on segmentID v_0000001_69 in 2076 ms
2020-07-10 12:22:25.898 [pool-3-thread-51] DEBUG o.v.c.s.r.ExtractionTask - Finished MedianColor on segmentID v_0000001_69 in 3526 ms
2020-07-10 12:22:25.926 [pool-3-thread-52] DEBUG o.v.c.s.r.ExtractionTask - Finished MedianColorGrid8 on segmentID v_0000001_69 in 3554 ms

...
...carries on...

lucaro · 2020-07-11T09:36:11Z

lucaro
Jul 11, 2020
Maintainer

This is kind-of intended behavior, or at least a consequence of the general architecture. The extraction logic is not responsible for managing the database connection and neither are the feature modules themselves. The database management logic on the other hand is not responsible for anything to do with the extraction, other than providing the required access to the specified database (if possible). It could be argued that the extraction logic should check (regularly?) if the database connection is still available, but then the question is what would be the optimal behavior in case it isn't?

0 replies

frankier · 2020-07-11T11:41:39Z

frankier
Jul 11, 2020
Author

Am I correct in my understanding that the database is essentially an index that the extraction command is responsible for building? (Among other things?)

Is the the current approach is that information is buffered until it becomes available again, or is it just dropped on the floor?

From my own perspective a "fail-fast" approach would be better. Say I'm doing some indexing on a computing cluster, but the database is not accessible. I start my batch job and come back to it a few days later and discover by trawling through the logs that the database was not connected. From my perspective I have used a bunch of computing resources and produced a bunch of output, but because there are no indices in the database, I suppose none of it can be reused, and we have to start from scratch. In that case computing resources have been used for essentially useless work (I think?)

The ideal behaviour would be that everything would be kind of atomic, so that if the database disappeared during the extraction process, then whatever data was associated with entities that didn't get fully indexed in the database would also be rolled back and then the extraction process would terminate early (make even saying ok: I managed to get 20% though or something).

The next best thing is an extraction process that makes a loud noise (by dying) when the database becomes unavailable (including or especially never being able to establish a connection in the first place) but possibly leaves a semi-inconsistent state. Ideally at this point the user would be told whether they should truncate their database or not before starting the extraction again.

Would it not be possible to just raise an UncheckedIOException or similar, perhaps making sure that this can only get raised when running an extraction process, and not when running as a server (in which case we should probably just wait and hope for the database to come back up).

0 replies

lucaro · 2020-07-11T14:34:39Z

lucaro
Jul 11, 2020
Maintainer

The database is used to store all the feature- and meta-data which is generated during the extraction. The extraction process can also produce other data, which is not stored to the database, such as, for example, thumbnail images for video sequences. Running extraction on a cluster with many jobs in parallel and writing to a database is not recommended, as the database could form a bottleneck. In these cases, it would be advisable to change the 'writer' configuration to store all the extracted information into local files instead and transfer these files after the extraction process is finished.

0 replies

frankier · 2020-07-12T10:56:42Z

frankier
Jul 12, 2020
Author

Okay, I will use the JSON writer although I wonder if this simply kicks the problem down the road, since presumably the slow insertion still needs to be performed after the extraction. On the other hand, I suppose we can at least reuse our JSONs if upgrading to a version of Cottontail with a new DB format. We should document that the JSON writer should be used in HPC --- see: #103

I will leave this issue open since I believe the original issue still stands as a papercut for new users. A 90% solution could just check if the database is available at the beginning and die immediately if not -- and just warn if the database connection disappears part way through an extraction. It could also be warned in the CLI that extracting directly to a database is not the preferred/most supported configuration.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction continues even when database is unavailable #187

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extraction continues even when database is unavailable #187

frankier Jul 10, 2020

Replies: 4 comments

lucaro Jul 11, 2020 Maintainer

frankier Jul 11, 2020 Author

lucaro Jul 11, 2020 Maintainer

frankier Jul 12, 2020 Author

frankier
Jul 10, 2020

lucaro
Jul 11, 2020
Maintainer

frankier
Jul 11, 2020
Author

lucaro
Jul 11, 2020
Maintainer

frankier
Jul 12, 2020
Author