working parallelization for different OS's #72

lweisburn · 2024-12-18T22:19:32Z

running be_func_parallel (i.e. request >1 nproc) uses multiprocessing.Pool. This has a different default depending on operating system, but only fork works with our code (not spawn, the default with iOS, and maybe not the newer standard forkserver)

To fix, we need to simply specify explicitly that multiprocessing uses fork:
multiprocessing.set_start_method('fork')

The text was updated successfully, but these errors were encountered:

mscho527 · 2024-12-29T01:05:16Z

Did #75 fully fix this issue for now? If so, we could maybe close this issue.

mcocdawc · 2025-01-03T21:49:07Z

#75 only made the tests working. The question about multiprocessing and different OSs is not yet entirely solved.

Just my 2cts:

If we want to support more than Linux we have to move way from fork-based multiprocessing.
Even if we stick to Linux and fork-based multiprocessing, I think there are some wrong assumptions in the code about how forking works. In this call to apply_async there is a copy of an argument. This suggests that the author thought that the other arguments are not copied, which is unfortunately not true. The passed arguments are always pickled, sent to the new process and unpickled, i.e. copied. Only module level variables are shared, in the fork-sense of copy-on-write. At the moment all arguments are copied, i.e. if there are n processes, this increases the memory demand for the arguments by a factor of n. If we use multiprocessing we should pass explicitly shared memory for expensive read-only arguments.
Thread-based parallelization is more dangerous, but works on Windows and seems to work wit the limited amount of tests. See my draft PR here

lweisburn · 2025-01-06T18:26:21Z

As @mcocdawc says ^^ !

#75 quickly fixed the code such that oneshot BE and regular BE work on multiple cores, as before the code was moved here. It also fixed and enabled the test suite for these calculations. It doesn't make any fix for non-Linux or explore different parallelization options.

As for the copying: the objects passed through multiprocessing were intentionally not the expensive ones (i.e., passing the ERI file location, not the full ERIs) because of the copies. The hf_veff matrix and 1 electron integrals are the largest I think. I don't think that we need to worry about memory in this step yet, though we could explore other options in the future? To my understanding, this behavior is the case with all the standard python multiprocessing.. ?

mscho527 · 2025-01-07T01:34:42Z

@mcocdawc Following up based on our discussion during the subgroups and what I've read on the python docs, I have a question regarding the ThreadPool option.

From what I understand, ThreadPool achieves I/O concurrency by switching context as necessary between different threads. Because it is still subject to CPython Global Interpreter Lock (GIL), one Python interpreter will accept operation from one of the threads at a time; i.e. there is no CPU concurrency here. In most cases, this is not a problem since I/O wait is often by far the biggest bottleneck.

For our fragment high-level calculations, however, I am not sure if these are also always I/O bound. We might need to run some benchmarks, but in the extreme end, I think there are implementations of integral-direct CCSD(T) reported to be CPU-bound.

Temporarily leaving the compatibility issue with non-POSIX systems aside, I assume we wanted ThreadPool to efficiently differentiate shared and distributed memory. Does using multiprocessing.shared_memory achieve what you need while still using 'fork'?

mcocdawc · 2025-01-09T20:04:10Z

What you say about the GIL is true in general, but we are lucky for the case of numerical-heavy number-crunching.
C-extensions can explicitly release the GIL and that is what numpy is doing, i.e. execution of numpy functions can happen concurrently.
In addition, when writing our own close-to-metal functions with numba we can declare them as threadsafe and release the GIL by passing nogil=True to the jit decorator.

In the case of numba we can also use a parallel compilation with parallel=True and using an explicitly parallel loop, i.e. prange instead of range. This tells the compiler that loop iterations are independent of each other; depending on context this allows SIMD instructions and/or multi-threaded execution. (See here )

I tried to find an example that is very much CPU-bound, namely finding an eigenvalue via the power-method. If the matrices are small enough that they fit into the cache, then we are not bound by memory-bandwidth anymore, but are truly bound by CPU operations, hence I find the lowest eigenvalue of several 3x3 matrices via repeated matrix-multiplication.

Imports

from concurrent.futures import ThreadPoolExecutor
from multiprocessing.pool import ThreadPool

import numba
import numpy as np
from numba import njit, prange
from numpy import sqrt
from numpy.linalg import norm

Definition of the power method

The get_hermitian_matrices shows how easy it is to use numba's prange.

@njit(nogil=True, parallel=True)
def get_hermitian_matrices(n_matrices, matrix_size, rng):
    A = rng.uniform(low=-1., high=1., size=(n_matrices, matrix_size, matrix_size))
    for i in prange(n_matrices):
        A[i, :, :] = (A[i, :, :] + A[i, :, :].T) / 2
    return A

@njit(nogil=True)
def normalize(x):
    return x / norm(x)

@njit(nogil=True)
def next_guess(M, x):
    x_next = normalize(M @ x)
    return x_next, norm(x - x_next)

def _power_method(M, epsilon=1e-10, start_guess=None, max_iter=1e4):
    if start_guess is None:
        start_guess = np.zeros(len(M))
        start_guess[0] = 1

    x_next, error = next_guess(M, start_guess)
    iter = 1
    while error >= epsilon and iter <= max_iter:
        x_next, error = next_guess(M, x_next)
        iter += 1
    return ((M @ x_next) / x_next).mean()

power_method = njit(nogil=True)(_power_method)
gil_power_method = njit(nogil=False)(_power_method)

Definition of the different parallel or serial executions

@njit
def serial_get_eigenvalues(h_matrices):
    L = h_matrices.shape[0]
    lambdas = np.empty(L, dtype=np.float64)
    for i in range(L):
        lambdas[i] = power_method(h_matrices[i, :, :])
    return lambdas


@njit(parallel=True)
def parallel_get_eigenvalues(h_matrices):
    L = h_matrices.shape[0]
    lambdas = np.empty(L, dtype=np.float64)
    for i in prange(L):
        lambdas[i] = power_method(h_matrices[i, :, :])
    return lambdas

def parallel_python_code(h_matrices, power_method, n_threads=10):
    L = h_matrices.shape[0]

    with ThreadPoolExecutor(max_workers=n_threads) as executor:
        lambdas = [executor.submit(power_method, h_matrices[i, :, :]) for i in range(L)]
        lambdas = [x.result() for x in lambdas]
    return lambdas

If you time the following executions you see, that the python code with ThreadPoolExecutor performs as fast as the numba jitted parallel version, iff the gil is released; otherwise it is as fast as the serial version.

timings

n_threads = 10
%time serial_get_eigenvalues(h_matrices)
numba.set_num_threads(n_threads)
%time parallel_get_eigenvalues(h_matrices):
%time parallel_python_code(h_matrices, power_method, n_threads=n_threads)
%time parallel_python_code(h_matrices, gil_power_method, n_threads=n_threads)

lweisburn assigned lweisburn and ShaunWeatherly Dec 18, 2024

mcocdawc added the bug Something isn't working label Dec 19, 2024

lweisburn mentioned this issue Dec 19, 2024

Fix parallelization and add parallel tests #75

Merged

mscho527 added this to the Future Release milestone Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

working parallelization for different OS's #72

working parallelization for different OS's #72

lweisburn commented Dec 18, 2024

mscho527 commented Dec 29, 2024

mcocdawc commented Jan 3, 2025

lweisburn commented Jan 6, 2025

mscho527 commented Jan 7, 2025

mcocdawc commented Jan 9, 2025

working parallelization for different OS's #72

working parallelization for different OS's #72

Comments

lweisburn commented Dec 18, 2024

mscho527 commented Dec 29, 2024

mcocdawc commented Jan 3, 2025

lweisburn commented Jan 6, 2025

mscho527 commented Jan 7, 2025

mcocdawc commented Jan 9, 2025