optimize encoder interpolation latency #309

bxue-l2 · 2024-03-03T02:26:31Z

Why are these changes needed?

This PR optimize encoder interpolation latency by

parallelizing FFT computation in RS
Removing computation of inverse Powers Of RootOfUnity by looking up an array backward

Previously, encoding 2MB data would take 2-3sec. After the change, the computation can be contained about 300ms

Checks

I've made sure the lint is passing in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

dmanc · 2024-03-03T23:45:47Z

encoding/rs/interpolation.go

-		tmp.Mul(&wPow, &w)
-
-		wPow.Set(&tmp)
+		// We cam lookup the inverse power by counting RootOfUnity backward


typo: cam -> can

jianoaix

How was the perf measured?

mooselumph · 2024-03-04T23:46:57Z

encoding/rs/encode.go

+		ys := polyEvals[g.ChunkLength*i : g.ChunkLength*(i+1)]
+		err := rb.ReverseBitOrderFr(ys)
+		if err != nil {
+			results <- err


should have continue here

Good catch. I used "return", because if anything that has error, the entire MakeFrames becomes false.

I think it's better to use continue. If you had enough errors for all of the workers to return, the program could hang on L104.

Continuing ensures that the workers continue to consume the requests.

Obviously, you could optimize this further if an error was likely here, but this would be a programming error so I don't think it needs to be optimized.

I see. good call.

mooselumph · 2024-03-04T23:47:05Z

encoding/rs/encode.go

+		}
+		coeffs, err := g.GetInterpolationPolyCoeff(ys, uint32(j))
+		if err != nil {
+			results <- err


bxue-l2 · 2024-03-05T02:30:36Z

How was the perf measured?

It is tested in preprod, by examining the encoder logs, see the link.

jianoaix · 2024-03-05T03:31:33Z

encoding/rs/encode.go

+
+func (g *Encoder) interpolyWorker(
+	polyEvals []fr.Element,
+	jobChan <-chan JobRequest,


can it just use a workerpool, instead of a dedicated channel to create job request

It is actually a good idea, I think we should create a global persistent object under prover? what do you think

Why global/persistent? Is the concern in workpool creation/destroy cost? It's generally better not have global/persistent object if can be avoided.

Say each requests is 256KB, 10MBps is equivalent to 40 parallel requests, and they need to share threads. If multiple objects are specified, it is unclear how to allocate them. I will mark it as todo, since I don't know the performance of the threadpool, compared to now.

jianoaix · 2024-03-21T22:30:40Z

encoding/kzg/prover/parametrized_prover.go

@@ -225,6 +225,7 @@ func (p *ParametrizedProver) proofWorker(
 				points: nil,
 				err:    err,
 			}
+			continue


How about just moving the for loop below inside this code block?

I used a else statement

jianoaix · 2024-03-21T22:32:57Z

encoding/rs/encode.go

@@ -56,8 +57,8 @@ func (g *Encoder) Encode(inputFr []fr.Element) (*GlobalPoly, []Frame, []uint32,
 		return nil, nil, nil, err
 	}

-	log.Printf("  SUMMARY: Encode %v byte among %v numNode takes %v\n",
-		len(inputFr)*encoding.BYTES_PER_COEFFICIENT, g.NumChunks, time.Since(start))
+	log.Printf("  SUMMARY: RSEncode %v byte among %v numNode with chunkSize %v takes %v\n",


nit: I think we settle on the terminology convention that "size" means num of bytes, and "length" means num of symbols

jianoaix · 2024-03-21T22:38:17Z

encoding/rs/encode.go

@@ -77,29 +78,45 @@ func (g *Encoder) MakeFrames(
 	indices := make([]uint32, 0)
 	frames := make([]Frame, g.NumChunks)

-	for i := uint64(0); i < uint64(g.NumChunks); i++ {
+	numWorker := uint64(g.NumRSWorker)


No need to convert to uint64, since L83 makes the conversion already.
Also it probably has no need to use that many workers.

it is assigned to number chunks at L84

jianoaix · 2024-03-21T22:41:06Z

encoding/rs/encode.go

+			frames,
+		)
+	}
+


nit: move the k defined on L76 down here could make it more readable

jianoaix · 2024-03-21T22:53:34Z

encoding/rs/encode.go

+		frames[k].Coeffs = coeffs
+	}
+
+	results <- nil


should this be inside the for loop? It'll be then one result per JobRequest

there are "numWorker" of threads, and "NumChunks" of jobs. If it is moved inside the for loop, the worker is terminated

jianoaix · 2024-03-22T21:10:49Z

encoding/rs/encode.go

 		}
+		jobChan <- jr
+		k++


is k needed? it's the same as i

jianoaix · 2024-03-22T21:11:23Z

encoding/rs/encode.go

+	for i := uint64(0); i < g.NumChunks; i++ {
+		j := rb.ReverseBitsLimited(uint32(g.NumChunks), uint32(i))
+		jr := JobRequest{
+			Index:      uint64(i),


i is already uint64

bxue-l2 requested review from mooselumph, jianoaix and dmanc March 3, 2024 21:35

bxue-l2 marked this pull request as ready for review March 3, 2024 22:18

dmanc reviewed Mar 4, 2024

View reviewed changes

jianoaix reviewed Mar 4, 2024

View reviewed changes

mooselumph reviewed Mar 4, 2024

View reviewed changes

jianoaix reviewed Mar 5, 2024

View reviewed changes

Ubuntu and others added 10 commits March 20, 2024 18:11

optimize encoder interpolation latency

50a52b2

fix:

88ffc29

remove computing inverse by lookingup rootofunity backward

97b5e39

rm encoder pprof

d8b637b

rm comments

09e4e19

more efficient

c6ba77b

return early if error happened inside worker thread

575c1df

fix return to continue

84dc3c0

change return to continue

55c9882

rebase

2ee68eb

bxue-l2 force-pushed the parallel-rs-fft-cache-rou-array branch from a130de0 to 2ee68eb Compare March 20, 2024 18:16

fix return to continue

331b14d

jianoaix reviewed Mar 21, 2024

View reviewed changes

fix thread logics and nit

c981800

jianoaix approved these changes Mar 22, 2024

View reviewed changes

simplify and rm var

12c2526

bxue-l2 merged commit 992e0f2 into Layr-Labs:master Mar 22, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize encoder interpolation latency #309

optimize encoder interpolation latency #309

bxue-l2 commented Mar 3, 2024 •

edited

Loading

dmanc Mar 3, 2024

jianoaix left a comment

mooselumph Mar 4, 2024

bxue-l2 Mar 5, 2024

mooselumph Mar 5, 2024

bxue-l2 Mar 5, 2024

mooselumph Mar 4, 2024

bxue-l2 Mar 5, 2024

bxue-l2 commented Mar 5, 2024

jianoaix Mar 5, 2024

bxue-l2 Mar 5, 2024

jianoaix Mar 5, 2024

bxue-l2 Mar 20, 2024

jianoaix Mar 21, 2024

bxue-l2 Mar 22, 2024

jianoaix Mar 21, 2024

jianoaix Mar 21, 2024

bxue-l2 Mar 22, 2024

jianoaix Mar 21, 2024

jianoaix Mar 21, 2024

bxue-l2 Mar 22, 2024

jianoaix Mar 22, 2024

jianoaix Mar 22, 2024

optimize encoder interpolation latency #309

optimize encoder interpolation latency #309

Conversation

bxue-l2 commented Mar 3, 2024 • edited Loading

Why are these changes needed?

Checks

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bxue-l2 commented Mar 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bxue-l2 commented Mar 3, 2024 •

edited

Loading