Replies: 7 comments
-
How many feature modules are you using during the extraction? If you are only using a small number, the decoding and 'house-keeping' overhead might be larger than the extraction workload, so you might not get a high resource utilization independent of the queue size. Since cineast does not have the possibility to work on multiple videos simultaneously, starting multiple instances in parallel would be your easiest option here. You should be able to set the output directory for the JSON output in the job file, so avoiding conflicts when writing files should also be possible. |
Beta Was this translation helpful? Give feedback.
-
I am just using the example configuration:
I will move on to arranging my own process level parallelism for now. |
Beta Was this translation helpful? Give feedback.
-
I noticed that the REST API provides the possibility of submitting extraction jobs. It seems there was also a project https://github.com/vitrivr/cthulhu which I assume used this API. Could you please tell me a bit about the status of this? |
Beta Was this translation helpful? Give feedback.
-
Cthulhu and the extraction API were two independent paths to a somewhat related goal, neither of which gained a lot of traction. The idea behind both was to have a means for continuous extraction, where you would have one or several vitrivr instances or machines where instances could be run sitting somewhere, waiting for input and an external mechanism which would provide this input. In the case of Cthulhu, it was meant more as an orchestration tool to distribute the extraction of large amounts of data in a cluster setting. It doesn't make use of this extraction API, since that API didn't exist at the time. The primary problem here was that, while the approach worked reasonably well in a grid-like setting, where you have a bunch of machines which are always there and you could use some of their resources, it didn't work all that well in a job-cluster setting. The extraction API came later, to enable continuous extension of an indexed collection of media items. We used it for some demos, where an external tool would crawl the web for some recent images matching some predefined properties and feed whatever it found to a vitrivr instance, running on the same machine. Other than such demos, we never did a lot with it, outside of such demo applications. |
Beta Was this translation helpful? Give feedback.
-
Okay. Thanks for the background. Edited to add: this seems like it might be a good profiler for picking up problems with lock contentions: https://github.com/jvm-profiling-tools/async-profiler I've had good experiences with profilers that work this way (halt using a signal at random and then inspect memory dumps to recover information about program state) for Python -- they give quite accurate/actionable data. |
Beta Was this translation helpful? Give feedback.
-
I didn't benchmark, but I suspect part of it might be these locks: Lots of features use one of these computed properties of the segment and they might all have to wait for these results. I don't have an easy fix, but often when there's this type of data dependency structure pipelining can help. |
Beta Was this translation helpful? Give feedback.
-
These locks are there to prevent having to perform this computation multiple times, so the alternative of waiting would be redundant computation in this case. |
Beta Was this translation helpful? Give feedback.
-
When performing an extraction on a machine with 40 cores and the following settings:
Most of the resources as not used. This is using the JSON writer as advised for HPC extraction in the other issue.
About 70% of the time it's as if single threaded
About 30% of the time it uses slightly more but not close to the full 40 cores
Since this is essentially an embarrassingly parallel problem (each video is independent), we should be able to do improve upon this which would help for performing extraction on large collections.
An easy way of fixing this would be to have "external" parallelism where many extraction processes are started with Snakemake or similar (I have been using this together with slurm like so https://github.com/frankier/singslurm ) and each extraction process would only deal with a single file before finishing, which would also allow extraction across multiple nodes. The resulting JSONs could then be merged later on. This is already almost possible I think by templating the file name into the the extraction job config but in this case I think the paths in the metadata JSONs will need to be fixed up. Or is this what the skip and limit parameters are intended for? Should I template the job file to start many jobs to deal with e.g. 10 videos at a time this way? If so I suppose these numbers include all files, or just mp4 files? If this is the recommended mechanism for HPC extractions, we should definitely document it.
Beta Was this translation helpful? Give feedback.
All reactions