Replies: 4 comments
-
This is kind-of intended behavior, or at least a consequence of the general architecture. The extraction logic is not responsible for managing the database connection and neither are the feature modules themselves. The database management logic on the other hand is not responsible for anything to do with the extraction, other than providing the required access to the specified database (if possible). It could be argued that the extraction logic should check (regularly?) if the database connection is still available, but then the question is what would be the optimal behavior in case it isn't? |
Beta Was this translation helpful? Give feedback.
-
Am I correct in my understanding that the database is essentially an index that the extraction command is responsible for building? (Among other things?) Is the the current approach is that information is buffered until it becomes available again, or is it just dropped on the floor? From my own perspective a "fail-fast" approach would be better. Say I'm doing some indexing on a computing cluster, but the database is not accessible. I start my batch job and come back to it a few days later and discover by trawling through the logs that the database was not connected. From my perspective I have used a bunch of computing resources and produced a bunch of output, but because there are no indices in the database, I suppose none of it can be reused, and we have to start from scratch. In that case computing resources have been used for essentially useless work (I think?) The ideal behaviour would be that everything would be kind of atomic, so that if the database disappeared during the extraction process, then whatever data was associated with entities that didn't get fully indexed in the database would also be rolled back and then the extraction process would terminate early (make even saying ok: I managed to get 20% though or something). The next best thing is an extraction process that makes a loud noise (by dying) when the database becomes unavailable (including or especially never being able to establish a connection in the first place) but possibly leaves a semi-inconsistent state. Ideally at this point the user would be told whether they should truncate their database or not before starting the extraction again. Would it not be possible to just raise an UncheckedIOException or similar, perhaps making sure that this can only get raised when running an extraction process, and not when running as a server (in which case we should probably just wait and hope for the database to come back up). |
Beta Was this translation helpful? Give feedback.
-
The database is used to store all the feature- and meta-data which is generated during the extraction. The extraction process can also produce other data, which is not stored to the database, such as, for example, thumbnail images for video sequences. Running extraction on a cluster with many jobs in parallel and writing to a database is not recommended, as the database could form a bottleneck. In these cases, it would be advisable to change the 'writer' configuration to store all the extracted information into local files instead and transfer these files after the extraction process is finished. |
Beta Was this translation helpful? Give feedback.
-
Okay, I will use the JSON writer although I wonder if this simply kicks the problem down the road, since presumably the slow insertion still needs to be performed after the extraction. On the other hand, I suppose we can at least reuse our JSONs if upgrading to a version of Cottontail with a new DB format. We should document that the JSON writer should be used in HPC --- see: #103 I will leave this issue open since I believe the original issue still stands as a papercut for new users. A 90% solution could just check if the database is available at the beginning and die immediately if not -- and just warn if the database connection disappears part way through an extraction. It could also be warned in the CLI that extracting directly to a database is not the preferred/most supported configuration. |
Beta Was this translation helpful? Give feedback.
-
On the development branch extraction continues even when database is unavailable. I'm not sure, but I think ideally the extraction process should just abort in this case to avoid outputting files which are not indexed. Also if the database becomes unreachable part way through. This way (hopefully) a partial dump can be restored.
Here are some illustrative logs:
Beta Was this translation helpful? Give feedback.
All reactions