-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
working parallelization for different OS's #72
Comments
Did #75 fully fix this issue for now? If so, we could maybe close this issue. |
#75 only made the tests working. The question about multiprocessing and different OSs is not yet entirely solved. Just my 2cts:
|
As @mcocdawc says ^^ ! #75 quickly fixed the code such that oneshot BE and regular BE work on multiple cores, as before the code was moved here. It also fixed and enabled the test suite for these calculations. It doesn't make any fix for non-Linux or explore different parallelization options. As for the copying: the objects passed through multiprocessing were intentionally not the expensive ones (i.e., passing the ERI file location, not the full ERIs) because of the copies. The hf_veff matrix and 1 electron integrals are the largest I think. I don't think that we need to worry about memory in this step yet, though we could explore other options in the future? To my understanding, this behavior is the case with all the standard python multiprocessing.. ? |
@mcocdawc Following up based on our discussion during the subgroups and what I've read on the python docs, I have a question regarding the From what I understand, For our fragment high-level calculations, however, I am not sure if these are also always I/O bound. We might need to run some benchmarks, but in the extreme end, I think there are implementations of integral-direct CCSD(T) reported to be CPU-bound. Temporarily leaving the compatibility issue with non-POSIX systems aside, I assume we wanted |
What you say about the GIL is true in general, but we are lucky for the case of numerical-heavy number-crunching. In the case of numba we can also use a parallel compilation with I tried to find an example that is very much CPU-bound, namely finding an eigenvalue via the power-method. If the matrices are small enough that they fit into the cache, then we are not bound by memory-bandwidth anymore, but are truly bound by CPU operations, hence I find the lowest eigenvalue of several 3x3 matrices via repeated matrix-multiplication. Importsfrom concurrent.futures import ThreadPoolExecutor
from multiprocessing.pool import ThreadPool
import numba
import numpy as np
from numba import njit, prange
from numpy import sqrt
from numpy.linalg import norm Definition of the power methodThe @njit(nogil=True, parallel=True)
def get_hermitian_matrices(n_matrices, matrix_size, rng):
A = rng.uniform(low=-1., high=1., size=(n_matrices, matrix_size, matrix_size))
for i in prange(n_matrices):
A[i, :, :] = (A[i, :, :] + A[i, :, :].T) / 2
return A
@njit(nogil=True)
def normalize(x):
return x / norm(x)
@njit(nogil=True)
def next_guess(M, x):
x_next = normalize(M @ x)
return x_next, norm(x - x_next)
def _power_method(M, epsilon=1e-10, start_guess=None, max_iter=1e4):
if start_guess is None:
start_guess = np.zeros(len(M))
start_guess[0] = 1
x_next, error = next_guess(M, start_guess)
iter = 1
while error >= epsilon and iter <= max_iter:
x_next, error = next_guess(M, x_next)
iter += 1
return ((M @ x_next) / x_next).mean()
power_method = njit(nogil=True)(_power_method)
gil_power_method = njit(nogil=False)(_power_method) Definition of the different parallel or serial executions@njit
def serial_get_eigenvalues(h_matrices):
L = h_matrices.shape[0]
lambdas = np.empty(L, dtype=np.float64)
for i in range(L):
lambdas[i] = power_method(h_matrices[i, :, :])
return lambdas
@njit(parallel=True)
def parallel_get_eigenvalues(h_matrices):
L = h_matrices.shape[0]
lambdas = np.empty(L, dtype=np.float64)
for i in prange(L):
lambdas[i] = power_method(h_matrices[i, :, :])
return lambdas
def parallel_python_code(h_matrices, power_method, n_threads=10):
L = h_matrices.shape[0]
with ThreadPoolExecutor(max_workers=n_threads) as executor:
lambdas = [executor.submit(power_method, h_matrices[i, :, :]) for i in range(L)]
lambdas = [x.result() for x in lambdas]
return lambdas If you time the following executions you see, that the python code with timingsn_threads = 10
%time serial_get_eigenvalues(h_matrices)
numba.set_num_threads(n_threads)
%time parallel_get_eigenvalues(h_matrices):
%time parallel_python_code(h_matrices, power_method, n_threads=n_threads)
%time parallel_python_code(h_matrices, gil_power_method, n_threads=n_threads) |
running
be_func_parallel
(i.e. request >1 nproc) usesmultiprocessing.Pool
. This has a different default depending on operating system, but onlyfork
works with our code (notspawn
, the default with iOS, and maybe not the newer standardforkserver
)To fix, we need to simply specify explicitly that
multiprocessing
usesfork
:multiprocessing.set_start_method('fork')
The text was updated successfully, but these errors were encountered: