-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help implementing 'parallel' writing to a RealizationSink with BiocParallel #20
Comments
The use of The challenge with the in-memory representations is that the memory is being modified in the memory space of each thread / process; that memory is NOT shared by the manager process. I don't think there's an easy way around that. A 'feature' for BiocParallel might be shared memory management, but that would be a big feature, and would probably require a custom 'realization back end' to work with it. Also, the ipc stuff is inter-process (i.e., on the same computer); some back-ends (notably BatchtoolsParam()) can work on clusters or clouds where the ipc lock doesn't help. |
Thanks, Martin, that's very helpful for my current use case in bsseq. I was naïvely hoping that I think what I'll do in the short term is limit the 'realization back end' to 'on-disk' backends and the 'parallelisation back end' to 'same computer' backends. If the user requests an in-memory representation of the data then I'll assume they've got enough RAM to pass around lists of vectors that I'll eventually combine into a matrix in the main process with |
Hi Pete, Where do we stand on this? Is there something that we should implement in DelayedArray to facilitate parallel writing to a RealizationSink? My feeling about this is that supporting parallel writing to an arbitrary RealizationSink is a hard problem. For on-disk backends that don't natively support concurrent writing, a simple workaround is to split the array to be written into column- or row-oriented blocks, then write the blocks in parallel by using one RealizationSink per block, and finally combine the on-disk blocks with a delayed We could introduce a new class (e.g. SplitRealizationSink or ParallelRealizationSink) to abstract the above strategy. H. |
Hi Hervé, It's been a while since I thought about this. The model in my head involves workers that read input files (e.g., BAM files, CSV files, etc.) and process them (e.g., filtering, selection, aggregation, etc.) and then either:
I think it's fair to say (1) would be preferable. The 'one RealizationSink per block' approach is obviously viable, but I hoped/envisioned that it would be a single, shared RealizationSink mostly because I'd like to avoid having to do the extra (and in my case, expensive) With some data types (like the whole-genome bisulfite-sequencing data handled by bsseq), this initial data ingestion/processing can be quite expensive (as is the writing to disk of these large datasets) and so you really only want to do this once and then be able to load the saved object for downstream analyses. |
I'm trying to implement writing to an arbitrary RealizationSink backend via
BiocParallel::bplapply()
with an arbitrary BiocParallelParam backend. That is, I want to be able to construct blocks of output, possibly in parallel, and then safely write each block to a HDF5Array, RleArray, etc. (controllable by the user) after each block is generated.I've managed to get this set up for an HDF5RealizationSink (my main use case) by using the inter-process locks available in BiocParallel to ensure the writes are safe (@mtmorgan am I using these correctly?). But I've not had any luck using the same code for an arrayRealizationSink. Nor can I get it to work for the RleArray backend, although here the problem seems to be slightly different.
Below is a toy example that tries to construct a DelayedMatrix column-by-column using this strategy, in conjunction with different DelayedArray backends:
Created on 2018-05-21 by the reprex package (v0.2.0).
Session info
The text was updated successfully, but these errors were encountered: