-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge Fourier method into main #23
Conversation
macos-14 runner image doesn't support py3.9 anymore.
@@ -29,7 +29,7 @@ lto = true | |||
codegen-units = 1 | |||
|
|||
[dependencies] | |||
pyo3 = { version = "0.20", features = ["abi3-py38", "extension-module"] } | |||
pyo3 = { version = "0.20", features = ["abi3-py310", "extension-module"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not trivial due to the new Bound
-API, but you may want to consider upgrading to pyo3 0.22 as it significantly reduces Python-to-Rust call overheads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Adam, it's great to hear from you! Yeah, I realized that updating the version created a few problems and I wanted to publish the new function as quickly as possible. But now, I should have some time to see how to upgrade pyo3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks are inconclusive, but there might actually be a slight improvement 👍
|
||
rayon::ThreadPoolBuilder::new() | ||
.num_threads(num_threads.unwrap_or(rayon::current_num_threads())) | ||
.build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will add a significant cost for initializing a new thread pool for each and every call, i.e. via the thread start-up cost.
Maybe this could be changed to use the global/default thread pool if num_threads
is None
? Especially as rayon::current_num_threads
is not the default (that would be RAYON_NUM_THREADS
or std::thread::available_parallelism
otherwise), but the number of threads in the global thread pool in contrast to the function documentation. (So basically it will use the number of threads chosen when the global pool was initialized which of course will be the default if it was initialized implicitly/automatically.)
Maybe something like
fn with_num_threads<F, R>(num_threads: Option<usize>, f: F) where F: FnOnce() -> R {
if let Some(num_threads) = num_threads {
ThreadPoolBuilder::new(num_threads).build().unwrap().install(f)
} else {
f()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out and thanks for the suggestion. I'll take a closer look soon.
.and(spectrum_factor) | ||
.and(z1) | ||
.and(z2) | ||
.fold(0.0, |sum, k, &spectrum_factor, &z1, &z2| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the accumulator is a scalar and hence cheap to copy/reduce, I wonder whether this would actually benefit from using a par_fold
instead of the sequential fold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And thanks again :-) I'll run a few benchmarks soon to see, if you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, the par_fold
version is about 50% slower in my benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite possible that the overhead is unhelpful for the inner loop which would otherwise benefit from auto-vectorization. Using with_min_len
might be a nice compromise between granularity and overhead, but it does add another tunable.
Oh man, the weeks before my holidays where pretty stressful and I had to constantly jump between different things. But still, creating the tag from this branch?! Thanks for catching this! |
Since the 1.0 tag was created from this branch, it should be merged into main.