Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu encoder no blob batching #627

Closed

Conversation

bxue-l2
Copy link
Contributor

@bxue-l2 bxue-l2 commented Jul 7, 2024

Why are these changes needed?

This PR adds GPU support for the encoder. The latency gains is approximately 33 and cost saving is 15 times better.

In implementation, the PR separates keys functions from the parameterized_prover by adding new interfaces called

  • ProofComputeDevice for kzg prover

  • RsComputeDevice for RSEncoder

The PR contains both CPU and GPU implementation for the devices, so that an encoder can choose either one depending on the actual hardware. (Note, many functions in the GPU implementation of ProofComputeDevice relies on CPUs)

Further, this PR optimized encoding performance by adding parallel threads for non-interfering tasks like

  • lengthproof, lengthcommit, kzgCommit, proofs and chunks

The GPU support is possible with Ingonyama GPU acceleration technology. GPU has best performance when computation is batched. Currently within a single blob, there are already plenty batching. However, batching over blobs give additional benefits.

This PR has not yet tested against the end-2-end test, the report and future works can be found in the link.

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Copy link
Contributor

@samlaf samlaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review. I don't understand most of the code here so only left a bunch of style/nit comments. Still getting familiar with the codebase, will need to ask you some questions.

)

func SetupMsm(rowsG1 [][]bn254.G1Affine) []bn254_icicle.Affine {
rowsG1Icicle := make([]bn254_icicle.Affine, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rowsG1Icicle := make([]bn254_icicle.Affine, 0)
var capacity int
for _, row := range rowsG1 {
capacity += len(row)
}
rowsG1Icicle := make([]bn254_icicle.Affine, 0, capacity)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how large your rowsG1 object is but preallocating the capacity of the slice will prevent a bunch of copying every power of 2.

This suggested code might not be accurate if bn254_icicle.Affine structs are of different size than bn254.G1Affine structs.


// batchSize will change later when used
cfg.BatchSize = int32(1)
cfg.NttAlgorithm = core.Radix2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cfg.NttAlgorithm = core.Radix2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already set above on line 15

Comment on lines +14 to +18
cfg.Ordering = core.KNN
cfg.NttAlgorithm = core.Radix2

// batchSize will change later when used
cfg.BatchSize = int32(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you resetting these params here? In case they change their default config? KNN and batch size of 1 seem to already be the defaults in https://github.com/ingonyama-zk/icicle/blob/7fd9ed1b4918246e688c168eb0339b4c34876425/wrappers/golang/core/ntt.go#L56

but they use auto (radix-2 or mixed-radix). Why are you explicitly setting to Radix2 and not mixed-radix? If you have a good reason would be good to document here.

Comment on lines +26 to +28
rouIcicle := bn254_icicle.ScalarField{}
limbs := core.ConvertUint64ArrToUint32Arr(rou[:])
rouIcicle.FromLimbs(limbs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rouIcicle := bn254_icicle.ScalarField{}
limbs := core.ConvertUint64ArrToUint32Arr(rou[:])
rouIcicle.FromLimbs(limbs)
limbs := core.ConvertUint64ArrToUint32Arr(rou[:])
rouIcicle := bn254_icicle.ScalarField{}
rouIcicle.FromLimbs(limbs)

styling preference

rouIcicle := bn254_icicle.ScalarField{}
limbs := core.ConvertUint64ArrToUint32Arr(rou[:])
rouIcicle.FromLimbs(limbs)
bn254_icicle_ntt.InitDomain(rouIcicle, cfg.Ctx, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?


type CpuComputer struct {
Fs *fft.FFTSettings
encoding.EncodingParams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHy is the encoding here? You don't seem to use it in ExtendPolyEval

Comment on lines +130 to +138
// Encoding Reed Solomon using FFT
func (g *Encoder) ExtendPolyEval(coeffs []fr.Element) ([]fr.Element, error) {

evals, err := g.Fs.FFT(coeffs, false)
if err != nil {
return nil, nil, err
return nil, err
}

return evals, pdCoeffs, nil
return evals, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you can remove this method here since it's been moved to {cpu,gpu}/extend_poly.go?

Comment on lines +25 to +26
g.GpuLock.Lock()
defer g.GpuLock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? Are there going to be multiple encoders sharing a GPU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh is it because both the rs encoder and the prover share the gpu? You might want to add that as a comment.

g.GpuLock.Lock()
defer g.GpuLock.Unlock()

scalarsSF := gpu_utils.ConvertFrToScalarFieldsBytes(coeffs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it important that coeffs is not tampered with? Otherwise you could potentially convert it inplace to save memory. Not sure how much of an improvement it would be but could be interesting to benchmark. You're doing a lot of memory allocations + copies everywhere.

Comment on lines +149 to +158
lengthProofResult := <-lengthProofChan
lengthCommitmentResult := <-lengthCommitmentChan
commitmentResult := <-commitmentChan
rsResult := <-rsChan
proofsResult := <-proofChan

// outputs is out of order - buttefly
proofs, err := p.Fs.FFTG1(sumVecInv[:dimE], false)
if err != nil {
return nil, err
if lengthProofResult.Err != nil || lengthCommitmentResult.Err != nil ||
commitmentResult.Err != nil || rsResult.Err != nil ||
proofsResult.Err != nil {
return nil, nil, nil, nil, nil, multierror.Append(lengthProofResult.Err, lengthCommitmentResult.Err, commitmentResult.Err, rsResult.Err, proofsResult.Err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using https://pkg.go.dev/golang.org/x/sync/errgroup could be very idiomatic here.
I've never used errgroup myself but I think the idea is that here if one of the computation errors early, your current code will still wait for everything to finish, and then return the joint error.
An errgroup would short-circuit the other computations as soon as one goroutine errors out, and return that. Unless you really want to know all errors that happened (and not just the first one), then this could be useful to save resources?

@bxue-l2 bxue-l2 changed the base branch from master to gpu-encode July 15, 2024 20:03
@bxue-l2 bxue-l2 deleted the branch Layr-Labs:gpu-encode July 16, 2024 03:22
@bxue-l2 bxue-l2 closed this Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants