Gpu encoder no blob batching #627

bxue-l2 · 2024-07-07T22:06:39Z

Why are these changes needed?

This PR adds GPU support for the encoder. The latency gains is approximately 33 and cost saving is 15 times better.

In implementation, the PR separates keys functions from the parameterized_prover by adding new interfaces called

ProofComputeDevice for kzg prover
RsComputeDevice for RSEncoder

The PR contains both CPU and GPU implementation for the devices, so that an encoder can choose either one depending on the actual hardware. (Note, many functions in the GPU implementation of ProofComputeDevice relies on CPUs)

Further, this PR optimized encoding performance by adding parallel threads for non-interfering tasks like

lengthproof, lengthcommit, kzgCommit, proofs and chunks

The GPU support is possible with Ingonyama GPU acceleration technology. GPU has best performance when computation is batched. Currently within a single blob, there are already plenty batching. However, batching over blobs give additional benefits.

This PR has not yet tested against the end-2-end test, the report and future works can be found in the link.

Checks

I've made sure the lint is passing in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

samlaf

First round of review. I don't understand most of the code here so only left a bunch of style/nit comments. Still getting familiar with the codebase, will need to ask you some questions.

samlaf · 2024-07-15T01:19:10Z

encoding/utils/gpu_utils/msm_setup.go

+)
+
+func SetupMsm(rowsG1 [][]bn254.G1Affine) []bn254_icicle.Affine {
+	rowsG1Icicle := make([]bn254_icicle.Affine, 0)


Suggested change

rowsG1Icicle := make([]bn254_icicle.Affine, 0)

var capacity int

for _, row := range rowsG1 {

capacity += len(row)

}

rowsG1Icicle := make([]bn254_icicle.Affine, 0, capacity)

Not sure how large your rowsG1 object is but preallocating the capacity of the slice will prevent a bunch of copying every power of 2.

This suggested code might not be accurate if bn254_icicle.Affine structs are of different size than bn254.G1Affine structs.

samlaf · 2024-07-15T01:25:45Z

encoding/utils/gpu_utils/ntt_setup.go

+
+	// batchSize will change later when used
+	cfg.BatchSize = int32(1)
+	cfg.NttAlgorithm = core.Radix2


Suggested change

cfg.NttAlgorithm = core.Radix2

already set above on line 15

samlaf · 2024-07-15T01:27:19Z

encoding/utils/gpu_utils/ntt_setup.go

+	cfg.Ordering = core.KNN
+	cfg.NttAlgorithm = core.Radix2
+
+	// batchSize will change later when used
+	cfg.BatchSize = int32(1)


Why are you resetting these params here? In case they change their default config? KNN and batch size of 1 seem to already be the defaults in https://github.com/ingonyama-zk/icicle/blob/7fd9ed1b4918246e688c168eb0339b4c34876425/wrappers/golang/core/ntt.go#L56

but they use auto (radix-2 or mixed-radix). Why are you explicitly setting to Radix2 and not mixed-radix? If you have a good reason would be good to document here.

samlaf · 2024-07-15T01:44:35Z

encoding/utils/gpu_utils/ntt_setup.go

+	rouIcicle := bn254_icicle.ScalarField{}
+	limbs := core.ConvertUint64ArrToUint32Arr(rou[:])
+	rouIcicle.FromLimbs(limbs)


Suggested change

rouIcicle := bn254_icicle.ScalarField{}

limbs := core.ConvertUint64ArrToUint32Arr(rou[:])

rouIcicle.FromLimbs(limbs)

limbs := core.ConvertUint64ArrToUint32Arr(rou[:])

rouIcicle := bn254_icicle.ScalarField{}

rouIcicle.FromLimbs(limbs)

styling preference

samlaf · 2024-07-15T01:46:09Z

encoding/utils/gpu_utils/ntt_setup.go

+	rouIcicle := bn254_icicle.ScalarField{}
+	limbs := core.ConvertUint64ArrToUint32Arr(rou[:])
+	rouIcicle.FromLimbs(limbs)
+	bn254_icicle_ntt.InitDomain(rouIcicle, cfg.Ctx, false)


what does this do?

samlaf · 2024-07-15T02:15:15Z

encoding/rs/cpu/extend_poly.go

+
+type CpuComputer struct {
+	Fs *fft.FFTSettings
+	encoding.EncodingParams


WHy is the encoding here? You don't seem to use it in ExtendPolyEval

samlaf · 2024-07-15T02:20:08Z

encoding/rs/encode.go

+// Encoding Reed Solomon using FFT
+func (g *Encoder) ExtendPolyEval(coeffs []fr.Element) ([]fr.Element, error) {
+
+	evals, err := g.Fs.FFT(coeffs, false)
 	if err != nil {
-		return nil, nil, err
+		return nil, err
 	}

-	return evals, pdCoeffs, nil
+	return evals, nil


Think you can remove this method here since it's been moved to {cpu,gpu}/extend_poly.go?

samlaf · 2024-07-15T02:24:54Z

encoding/rs/gpu/extend_poly.go

+	g.GpuLock.Lock()
+	defer g.GpuLock.Unlock()


why is this needed? Are there going to be multiple encoders sharing a GPU?

oh is it because both the rs encoder and the prover share the gpu? You might want to add that as a comment.

samlaf · 2024-07-15T02:26:12Z

encoding/rs/gpu/extend_poly.go

+	g.GpuLock.Lock()
+	defer g.GpuLock.Unlock()
+
+	scalarsSF := gpu_utils.ConvertFrToScalarFieldsBytes(coeffs)


is it important that coeffs is not tampered with? Otherwise you could potentially convert it inplace to save memory. Not sure how much of an improvement it would be but could be interesting to benchmark. You're doing a lot of memory allocations + copies everywhere.

samlaf · 2024-07-15T02:49:49Z

encoding/kzg/prover/parametrized_prover.go

+	lengthProofResult := <-lengthProofChan
+	lengthCommitmentResult := <-lengthCommitmentChan
+	commitmentResult := <-commitmentChan
+	rsResult := <-rsChan
+	proofsResult := <-proofChan

-	// outputs is out of order - buttefly
-	proofs, err := p.Fs.FFTG1(sumVecInv[:dimE], false)
-	if err != nil {
-		return nil, err
+	if lengthProofResult.Err != nil || lengthCommitmentResult.Err != nil ||
+		commitmentResult.Err != nil || rsResult.Err != nil ||
+		proofsResult.Err != nil {
+		return nil, nil, nil, nil, nil, multierror.Append(lengthProofResult.Err, lengthCommitmentResult.Err, commitmentResult.Err, rsResult.Err, proofsResult.Err)


I think using https://pkg.go.dev/golang.org/x/sync/errgroup could be very idiomatic here.
I've never used errgroup myself but I think the idea is that here if one of the computation errors early, your current code will still wait for everything to finish, and then return the joint error.
An errgroup would short-circuit the other computations as soon as one goroutine errors out, and return that. Unless you really want to know all errors that happened (and not just the first one), then this could be useful to save resources?

Ubuntu added 6 commits July 6, 2024 23:56

init

3c6af97

add mod

da40cb9

working kzg and rs

4102710

working parallel kzg proof compute

0e81887

optimize by adding pipeline and multi-thread conversion

acd9a42

fix edge case when chunk len is 1 and multithread worker num

9e4c04a

bxue-l2 requested review from dmanc, mooselumph, jianoaix and samlaf July 7, 2024 22:06

samlaf reviewed Jul 15, 2024

View reviewed changes

bxue-l2 changed the base branch from master to gpu-encode July 15, 2024 20:03

bxue-l2 deleted the branch Layr-Labs:gpu-encode July 16, 2024 03:22

bxue-l2 closed this Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu encoder no blob batching #627

Gpu encoder no blob batching #627

bxue-l2 commented Jul 7, 2024 •

edited

Loading

samlaf left a comment

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

samlaf Jul 15, 2024

-	rowsG1Icicle := make([]bn254_icicle.Affine, 0)
+        var capacity int
+        for _, row := range rowsG1 {
+            capacity += len(row)
+        }
+	rowsG1Icicle := make([]bn254_icicle.Affine, 0, capacity)

Gpu encoder no blob batching #627

Gpu encoder no blob batching #627

Conversation

bxue-l2 commented Jul 7, 2024 • edited Loading

Why are these changes needed?

Checks

samlaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bxue-l2 commented Jul 7, 2024 •

edited

Loading