Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize encoder interpolation latency #309

Merged
merged 13 commits into from
Mar 22, 2024

Conversation

bxue-l2
Copy link
Contributor

@bxue-l2 bxue-l2 commented Mar 3, 2024

Why are these changes needed?

This PR optimize encoder interpolation latency by

  • parallelizing FFT computation in RS
  • Removing computation of inverse Powers Of RootOfUnity by looking up an array backward

Previously, encoding 2MB data would take 2-3sec. After the change, the computation can be contained about 300ms

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@bxue-l2 bxue-l2 requested review from mooselumph, jianoaix and dmanc March 3, 2024 21:35
@bxue-l2 bxue-l2 marked this pull request as ready for review March 3, 2024 22:18
tmp.Mul(&wPow, &w)

wPow.Set(&tmp)
// We cam lookup the inverse power by counting RootOfUnity backward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: cam -> can

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the perf measured?

ys := polyEvals[g.ChunkLength*i : g.ChunkLength*(i+1)]
err := rb.ReverseBitOrderFr(ys)
if err != nil {
results <- err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have continue here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I used "return", because if anything that has error, the entire MakeFrames becomes false.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use continue. If you had enough errors for all of the workers to return, the program could hang on L104.

Continuing ensures that the workers continue to consume the requests.

Obviously, you could optimize this further if an error was likely here, but this would be a programming error so I don't think it needs to be optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. good call.

}
coeffs, err := g.GetInterpolationPolyCoeff(ys, uint32(j))
if err != nil {
results <- err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above

@bxue-l2
Copy link
Contributor Author

bxue-l2 commented Mar 5, 2024

How was the perf measured?

It is tested in preprod, by examining the encoder logs, see the link.


func (g *Encoder) interpolyWorker(
polyEvals []fr.Element,
jobChan <-chan JobRequest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it just use a workerpool, instead of a dedicated channel to create job request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually a good idea, I think we should create a global persistent object under prover? what do you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why global/persistent? Is the concern in workpool creation/destroy cost? It's generally better not have global/persistent object if can be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say each requests is 256KB, 10MBps is equivalent to 40 parallel requests, and they need to share threads. If multiple objects are specified, it is unclear how to allocate them. I will mark it as todo, since I don't know the performance of the threadpool, compared to now.

@bxue-l2 bxue-l2 force-pushed the parallel-rs-fft-cache-rou-array branch from a130de0 to 2ee68eb Compare March 20, 2024 18:16
@@ -225,6 +225,7 @@ func (p *ParametrizedProver) proofWorker(
points: nil,
err: err,
}
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just moving the for loop below inside this code block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a else statement

@@ -56,8 +57,8 @@ func (g *Encoder) Encode(inputFr []fr.Element) (*GlobalPoly, []Frame, []uint32,
return nil, nil, nil, err
}

log.Printf(" SUMMARY: Encode %v byte among %v numNode takes %v\n",
len(inputFr)*encoding.BYTES_PER_COEFFICIENT, g.NumChunks, time.Since(start))
log.Printf(" SUMMARY: RSEncode %v byte among %v numNode with chunkSize %v takes %v\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we settle on the terminology convention that "size" means num of bytes, and "length" means num of symbols

@@ -77,29 +78,45 @@ func (g *Encoder) MakeFrames(
indices := make([]uint32, 0)
frames := make([]Frame, g.NumChunks)

for i := uint64(0); i < uint64(g.NumChunks); i++ {
numWorker := uint64(g.NumRSWorker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to convert to uint64, since L83 makes the conversion already.
Also it probably has no need to use that many workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is assigned to number chunks at L84

frames,
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move the k defined on L76 down here could make it more readable

frames[k].Coeffs = coeffs
}

results <- nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be inside the for loop? It'll be then one result per JobRequest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are "numWorker" of threads, and "NumChunks" of jobs. If it is moved inside the for loop, the worker is terminated

}
jobChan <- jr
k++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is k needed? it's the same as i

for i := uint64(0); i < g.NumChunks; i++ {
j := rb.ReverseBitsLimited(uint32(g.NumChunks), uint32(i))
jr := JobRequest{
Index: uint64(i),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is already uint64

@bxue-l2 bxue-l2 merged commit 992e0f2 into Layr-Labs:master Mar 22, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants