Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working parallelization for different OS's #72

Open
lweisburn opened this issue Dec 18, 2024 · 5 comments
Open

working parallelization for different OS's #72

lweisburn opened this issue Dec 18, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@lweisburn
Copy link
Contributor

running be_func_parallel (i.e. request >1 nproc) uses multiprocessing.Pool. This has a different default depending on operating system, but only fork works with our code (not spawn, the default with iOS, and maybe not the newer standard forkserver)

To fix, we need to simply specify explicitly that multiprocessing uses fork:
multiprocessing.set_start_method('fork')

@mscho527
Copy link
Member

Did #75 fully fix this issue for now? If so, we could maybe close this issue.

@mcocdawc
Copy link
Contributor

mcocdawc commented Jan 3, 2025

#75 only made the tests working. The question about multiprocessing and different OSs is not yet entirely solved.

Just my 2cts:

  • If we want to support more than Linux we have to move way from fork-based multiprocessing.

  • Even if we stick to Linux and fork-based multiprocessing, I think there are some wrong assumptions in the code about how forking works. In this call to apply_async there is a copy of an argument. This suggests that the author thought that the other arguments are not copied, which is unfortunately not true. The passed arguments are always pickled, sent to the new process and unpickled, i.e. copied. Only module level variables are shared, in the fork-sense of copy-on-write. At the moment all arguments are copied, i.e. if there are n processes, this increases the memory demand for the arguments by a factor of n. If we use multiprocessing we should pass explicitly shared memory for expensive read-only arguments.

  • Thread-based parallelization is more dangerous, but works on Windows and seems to work wit the limited amount of tests. See my draft PR here

@lweisburn
Copy link
Contributor Author

As @mcocdawc says ^^ !

#75 quickly fixed the code such that oneshot BE and regular BE work on multiple cores, as before the code was moved here. It also fixed and enabled the test suite for these calculations. It doesn't make any fix for non-Linux or explore different parallelization options.

As for the copying: the objects passed through multiprocessing were intentionally not the expensive ones (i.e., passing the ERI file location, not the full ERIs) because of the copies. The hf_veff matrix and 1 electron integrals are the largest I think. I don't think that we need to worry about memory in this step yet, though we could explore other options in the future? To my understanding, this behavior is the case with all the standard python multiprocessing.. ?

@mscho527
Copy link
Member

mscho527 commented Jan 7, 2025

@mcocdawc Following up based on our discussion during the subgroups and what I've read on the python docs, I have a question regarding the ThreadPool option.

From what I understand, ThreadPool achieves I/O concurrency by switching context as necessary between different threads. Because it is still subject to CPython Global Interpreter Lock (GIL), one Python interpreter will accept operation from one of the threads at a time; i.e. there is no CPU concurrency here. In most cases, this is not a problem since I/O wait is often by far the biggest bottleneck.

For our fragment high-level calculations, however, I am not sure if these are also always I/O bound. We might need to run some benchmarks, but in the extreme end, I think there are implementations of integral-direct CCSD(T) reported to be CPU-bound.

Temporarily leaving the compatibility issue with non-POSIX systems aside, I assume we wanted ThreadPool to efficiently differentiate shared and distributed memory. Does using multiprocessing.shared_memory achieve what you need while still using 'fork'?

@mcocdawc
Copy link
Contributor

mcocdawc commented Jan 9, 2025

What you say about the GIL is true in general, but we are lucky for the case of numerical-heavy number-crunching.
C-extensions can explicitly release the GIL and that is what numpy is doing, i.e. execution of numpy functions can happen concurrently.
In addition, when writing our own close-to-metal functions with numba we can declare them as threadsafe and release the GIL by passing nogil=True to the jit decorator.

In the case of numba we can also use a parallel compilation with parallel=True and using an explicitly parallel loop, i.e. prange instead of range. This tells the compiler that loop iterations are independent of each other; depending on context this allows SIMD instructions and/or multi-threaded execution. (See here )

I tried to find an example that is very much CPU-bound, namely finding an eigenvalue via the power-method. If the matrices are small enough that they fit into the cache, then we are not bound by memory-bandwidth anymore, but are truly bound by CPU operations, hence I find the lowest eigenvalue of several 3x3 matrices via repeated matrix-multiplication.

Imports
from concurrent.futures import ThreadPoolExecutor
from multiprocessing.pool import ThreadPool

import numba
import numpy as np
from numba import njit, prange
from numpy import sqrt
from numpy.linalg import norm
Definition of the power method

The get_hermitian_matrices shows how easy it is to use numba's prange.

@njit(nogil=True, parallel=True)
def get_hermitian_matrices(n_matrices, matrix_size, rng):
    A = rng.uniform(low=-1., high=1., size=(n_matrices, matrix_size, matrix_size))
    for i in prange(n_matrices):
        A[i, :, :] = (A[i, :, :] + A[i, :, :].T) / 2
    return A

@njit(nogil=True)
def normalize(x):
    return x / norm(x)

@njit(nogil=True)
def next_guess(M, x):
    x_next = normalize(M @ x)
    return x_next, norm(x - x_next)

def _power_method(M, epsilon=1e-10, start_guess=None, max_iter=1e4):
    if start_guess is None:
        start_guess = np.zeros(len(M))
        start_guess[0] = 1

    x_next, error = next_guess(M, start_guess)
    iter = 1
    while error >= epsilon and iter <= max_iter:
        x_next, error = next_guess(M, x_next)
        iter += 1
    return ((M @ x_next) / x_next).mean()

power_method = njit(nogil=True)(_power_method)
gil_power_method = njit(nogil=False)(_power_method)
Definition of the different parallel or serial executions
@njit
def serial_get_eigenvalues(h_matrices):
    L = h_matrices.shape[0]
    lambdas = np.empty(L, dtype=np.float64)
    for i in range(L):
        lambdas[i] = power_method(h_matrices[i, :, :])
    return lambdas


@njit(parallel=True)
def parallel_get_eigenvalues(h_matrices):
    L = h_matrices.shape[0]
    lambdas = np.empty(L, dtype=np.float64)
    for i in prange(L):
        lambdas[i] = power_method(h_matrices[i, :, :])
    return lambdas

def parallel_python_code(h_matrices, power_method, n_threads=10):
    L = h_matrices.shape[0]

    with ThreadPoolExecutor(max_workers=n_threads) as executor:
        lambdas = [executor.submit(power_method, h_matrices[i, :, :]) for i in range(L)]
        lambdas = [x.result() for x in lambdas]
    return lambdas

If you time the following executions you see, that the python code with ThreadPoolExecutor performs as fast as the numba jitted parallel version, iff the gil is released; otherwise it is as fast as the serial version.

timings
n_threads = 10
%time serial_get_eigenvalues(h_matrices)
numba.set_num_threads(n_threads)
%time parallel_get_eigenvalues(h_matrices):
%time parallel_python_code(h_matrices, power_method, n_threads=n_threads)
%time parallel_python_code(h_matrices, gil_power_method, n_threads=n_threads)

@mscho527 mscho527 added this to the Future Release milestone Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants