Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type-preserving alternative to read_block #40

Closed
LTLA opened this issue Feb 8, 2019 · 9 comments
Closed

Type-preserving alternative to read_block #40

LTLA opened this issue Feb 8, 2019 · 9 comments

Comments

@LTLA
Copy link
Contributor

LTLA commented Feb 8, 2019

This request is mainly motivated by the current DelayedMatrix matrix multiplication code, which does a read_block call to obtain a dense matrix for actual multiplication. This prevents us from taking advantage of underlying features of the matrix representation that might enable faster multiplication with native methods, e.g., dgCMatrix with Matrix::%*%. I guess this touches on some stuff that SparseArraySeed was designed for, but in a more general fashion (e.g., to handle remote matrices with operations executed server-side).

This would be resolved by having a read_block alternative that tries to preserve the underlying nature of the matrix. The easiest way to do this is to allow extract_array to return something other than a dense ordinary matrix, e.g., by passing something like dense=FALSE to extract_array. I know that an arbitrary matrix representation may not support various delayed operations, whereas a dense matrix always will. In such cases, one could use selectMethod() to check whether the delayed operation dispatch is possible for the matrix representation; execute it if it is; and convert it into a dense matrix if it isn't.

Does this sound reasonable, or am I missing something?

@lawremi
Copy link

lawremi commented Feb 12, 2019

In principle alternative array representations should support the same operations as base R matrices. There may be corner cases, but I think this is a reasonable proposal.

@hpages
Copy link
Contributor

hpages commented Oct 22, 2019

@LTLA Hi Aaron,

I don't think it's realistic to imagine that read_block() could preserve the type in general. For example what would that mean exactly when the object is a cbind() between objects with seeds of different types? Even when the object has a single seed, say a TENxMatrixSeed, I don't see how read_block() could return a TENxMatrixSeed.

The purpose of read_block() is to load a block of data in memory (maybe be I should have called the function load_block() instead.). Returning the block as a dense matrix was the natural choice: this makes implementation of block-processed methods like rowSums or %*% easy by making them agnostic of the exact type of the underlying seeds. Of course returning dense blocks when the seed is sparse (and when the delayed operations carried by the object preserve sparsity) is not good. This can be addressed by my proposal below.

Handling remote matrices with operations executed server-side is IMO a different story. The way I see it it would bypass block-processing so has not much to do with read_block(). This is the topic of issue #46.

H.

Proposal: What I had in mind for a long time and you can already use for the current DelayedMatrix multiplication code is to support 2 modes of block processing: the dense mode (via read_block()) and the sparse mode (via read_sparse_block()). As far as block processing is concerned, that should cover everything we need for now. Last month I provided some details about this on the big-data channel of the community-bioc slack. I'm copying it here again for reference:

FWIW DelayedArray provides extract_sparse_array(), read_sparse_block() and write_sparse_block(), the sparsity-preserving versions of extract_array(), read_block() and write_block(). Note that extract_sparse_array() and read_sparse_block() return a SparseArraySeed object, which represents a sparse array. Coercion back and forth between 2D SparseArraySeed objects and dgCMatrix is supported.

Important: you should only use extract_sparse_array() or read_sparse_block() on objects for which is_sparse() is TRUE. You can still call these functions on an object for which is_sparse() is FALSE, and they might seem to work, but they'll probably silently return something wrong. The reason extract_sparse_array() or read_sparse_block() don't perform the is_sparse() check is that these functions are typically used inside a loop in the context of block processing so the same check would be performed again and again. This could add some significant cost to the overall block processing of the object (the check can be a little expensive depending on what delayed operations the object carries). So it makes more sense to perform the check upfront before entering the main block processing loop.

See for example how is_sparse / read_sparse_block / write_sparse_block / read_block / write_block are used in the BLOCK_write_to_sink() function.

The sparsity-preserving capabilities in DelayedArray/HDF5Array are still a work-in-progress and incomplete (e.g. the DelayedMatrix row/col summarization methods should use them but they don't yet).

However everything you need for a %*% implementation that uses sparse blocks should be here.

@LTLA
Copy link
Contributor Author

LTLA commented Oct 22, 2019

Well, I guess I don't have a non-sparse example of needing a block in the native structure, so your proposal does cover the use case for now. It seems that it should just be a simple case of splitting the current read_block call into an if/else that calls read_sparse_block as appropriate.

@hpages
Copy link
Contributor

hpages commented Oct 22, 2019

Something like that. For BLOCK_write_to_sink(), I've put the if/else before entering the block processing top-level loop.

There is also the question of maybe using a grid with bigger blocks when x is sparse. Right now block-processed methods try to honor getAutoBlockSize() as much as possible but when the blocks are sparse it would make sense to use bigger blocks. Problem is that there is no way to know a priori how much bigger we can make them. For now maybe we could just make them twice or 5x bigger (i.e. use blocks of size 5L * getAutoBlockSize()) which is completely arbitrary. Another approach could be to let the user control this via something like setAutoSparseBlockSize()/getAutoSparseBlockSize(). Anyway, this is kind of orthogonal to your problem of implementing a %*% that uses sparse blocks.

@hpages
Copy link
Contributor

hpages commented Jun 9, 2020

Following up on the painful conversation of PR #64, I'm open to adding an argument to read_block() e.g. as.sparse if that makes the switch between "dense block processing" and "sparse block processing" easier. This would allow developers to do:

    x_is_sparse <- is_sparse(x)
    for (bid in seq_len(nblock)) {
        viewport <- grid[[bid]]
        block <- read_block(x, viewport, as.sparse=x_is_sparse)
        ... do something with this block
    }

instead of:

    readblock <- if (is_sparse(x)) read_sparse_block else read_block
    for (bid in seq_len(nblock)) {
        viewport <- grid[[bid]]
        block <- readblock(x, viewport)
        ... do something with this block
    }

So a light convenience only. With maybe also support for as.sparse=NA which would decide between dense/sparse on a per-block basis.

In any case, I don't want to touch extract_array's contract. It's SACRED! And the matrix multiplication code is not a use case strong enough to change that.

Time to close this one too.

@hpages hpages closed this as completed Jun 9, 2020
@lawremi
Copy link

lawremi commented Jun 9, 2020

Thanks for thinking more about this, Herve. The second case would be preferable if there were just a function that just returned the appropriate reader function, readblock in this case. Or, there could be a variant of read_block() that defaults as_sparse=is_sparse(x).

@hpages
Copy link
Contributor

hpages commented Jun 9, 2020

Maybe I should clarify the meaning of is_sparse(x). This detects structural sparsity as opposed to quantitative sparsity. Structural sparsity is a boolean that describes a global property of DelayedArray object x so is the same for all the blocks in x, at least in the current model. So when I say we could support as.sparse=NA which would decide between dense/sparse on a per-block basis, I'm just trying to be nice but I've no idea what it means exactly. It's just marketing.

(Maybe it's not too crazy to imagine a backend that uses a mix of sparse/dense chunks depending on the content of each chunk but (1) I've not seen any so far so I wouldn't consider this a solid use case and (2) blocks are typically made of many chunks so even if you have an easy/cheap way to know what the chunks are like there is no clear notion of native representation at the block level.)

So let me reformulate the proposal in a very concrete way:

  • Add the as.sparse argument to read_block().
  • Supported values would be FALSE, TRUE, and NA. If FALSE, the block is returned as an ordinary (dense) array. If TRUE, as a SparseArraySeed object. Using NA would be equivalent to using as.sparse=is_sparse(x). So as.sparse=NA will make the same choice for all blocks and not decide between dense/sparse on a per-block basis as I said earlier.
  • Have it set to FALSE by default (for backward compatibility with existing code).

Would we still need a function that just returned the appropriate reader function? Or a variant of read_block() that defaults to as.sparse=is_sparse(x)? Seems a little bit too much to introduce a new verb just for that.

@hpages
Copy link
Contributor

hpages commented Jun 10, 2020

One first easy simplification is to get rid of the write_sparse_block() generic and methods (see commit 6a7e5f2). In hindsight introducing this was misguided and an unnecessary complication.

The new as.sparse argument to read_block() is coming next. So with this, the API for reading/writing blocks will be reduced to read_block() and write_block(). No need for the end user to know anything about read_sparse_block() or extract_sparse_array().

@hpages
Copy link
Contributor

hpages commented Jun 10, 2020

Done (commit c68116a).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants