Python JAX/Numpy implementation of EGSnrc #658

ftessier · 2020-12-31T15:17:26Z

ftessier
Dec 31, 2020
Maintainer

A discussion in the pymedphys repository has recently touched upon the topic of rewriting EGSnrc in a modern language. @SimonBiggs mentioned the possibility to use JAX/Numpy to run simulations, with "native" support for GPU compilation. This is at least worth a good look, building toy models to study performance. @SimonBiggs has shown that the out-of-the-box GPU compilation is a couple orders of magnitude faster than straight CPU runs (to simply update "particle" arrays). The CPU runs with Numpy are nearly as fast as C++.

For the record, here are the python code and timing results (relying on simple wall clock time, careful!) to update (r, u, E) for 1e6 particles:

# install jax: pip install --upgrade jax jaxlib

from jax import random, jit
import jax.numpy as jnp
import time

# timer decorator
def timer(f):
    def wrap(*args, **kwargs):
        start = time.time()
        ret = f(*args, **kwargs)
        stop = time.time()
        duration = (stop-start)*1000.0
        print('{:s} duration: {:.3f} ms'.format(f.__name__, duration))
        return ret
    return wrap

# defines
NUM_PARTICLES = int(1e6)
ITERATIONS = 10
RUNS = 10

# update particle arrays a number of times
# @ jit
@timer
def run_iterations(rng_key, positions, directions, energies):
    for _ in range(ITERATIONS):
        random_normal = random.normal(rng_key, shape=(7, NUM_PARTICLES))
        (rng_key, ) = random.split(rng_key, 1)

        positions += random_normal[0:3, :]
        directions += random_normal[3:6, :]
        energies += random_normal[7, :]

    return rng_key, positions, directions, energies

# main function
def main():
    seed = 1
    rng_key = random.PRNGKey(seed)

    positions = jnp.zeros((3, NUM_PARTICLES))
    directions = jnp.zeros((3, NUM_PARTICLES))
    energies = jnp.zeros((1, NUM_PARTICLES))

    (rng_key, positions, directions, energies) = run_iterations(rng_key, positions, directions, energies)


# call main function
for _ in range(RUNS):
    main()

run_iterations duration: 2723.159 ms.  # don't mind this one!
run_iterations duration: 1008.634 ms
run_iterations duration: 1045.265 ms
run_iterations duration: 1102.731 ms
run_iterations duration: 1017.430 ms
run_iterations duration: 1033.043 ms
run_iterations duration: 1092.378 ms
run_iterations duration: 1145.762 ms
run_iterations duration: 1131.078 ms
run_iterations duration: 1055.485 ms

And here are the equivalent C++ code, with the same random number generator (and equally poor timer!):

// compile with: g++ -std=c++17 -pedantic -Wall -O3 -Wextra particles.cc

// includes
#include <iostream>
#include <vector>
#include <chrono>
#include <random>

// defines
#define NUM_PARTICLES 1000000
#define ITERATIONS 10
#define RUNS 10

// main
int main () {

    // positions
    std::vector<double> x(NUM_PARTICLES, 0.0);
    std::vector<double> y(NUM_PARTICLES, 0.0);
    std::vector<double> z(NUM_PARTICLES, 0.0);

    // directions
    std::vector<double> u(NUM_PARTICLES, 0.0);
    std::vector<double> v(NUM_PARTICLES, 0.0);
    std::vector<double> w(NUM_PARTICLES, 0.0);

    // energies
    std::vector<double> E(NUM_PARTICLES, 0.0);

    // random number generator
    std::mt19937 generator(1);
    std::uniform_real_distribution<double> sample(-1.0, 1.0);

    // update particle arrays a number of times
    for (int run=0; run<RUNS; run++) {

        // poor man's timer
        auto start = std::chrono::high_resolution_clock::now();

        // update particle arrays
        for (int i=0; i<ITERATIONS; i++) {
            for (int n=0; n<NUM_PARTICLES; n++) {
                x[n] += sample(generator);
                y[n] += sample(generator);
                z[n] += sample(generator);
                u[n] += sample(generator);
                v[n] += sample(generator);
                w[n] += sample(generator);
                E[n] += sample(generator);
            }
        }

        // report duration
        auto stop = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop-start);
        std::cout << "duration = " << duration.count() << " milliseconds" << std::endl;
    }

    return EXIT_SUCCESS;
}

duration = 819 milliseconds
duration = 795 milliseconds
duration = 779 milliseconds
duration = 788 milliseconds
duration = 762 milliseconds
duration = 782 milliseconds
duration = 791 milliseconds
duration = 760 milliseconds
duration = 787 milliseconds
duration = 773 milliseconds

Please comment further regarding the outlook for implementing EGSnrc in python. We can't expect efficiency to match optimized C++, but if we get GPU (and TPU!) compilation for "free" and gain in code clarity with python, my opinion is that it would be worth a shot.

ftessier · 2020-12-31T15:54:01Z

ftessier
Dec 31, 2020
Maintainer Author

Note that in the C++ code above, using plain arrays (as in double *x = new double[NUM_PARTICLES]) is not faster than using vector<double>; in fact, it seems a little slower. Regardless, we would still want to use std vectors in the implementation.

4 replies

SimonBiggs Dec 31, 2020

Importantly, for the JAX side of things, the code above is missing @jit above the function which contains the for loop. This would likely only matter when there is a GPU in the mix.

In the following:

https://colab.research.google.com/drive/1QZKnByThJrq9NhgBbFb8Nx6cPybFVWZJ?usp=sharing

Set the runtime to GPU (instructions copied in here for others to follow along) by clicking "Change runtime type":

Then select "GPU":

Then I compared:

def run_iterations(rng_key, position_vectors, direction_vectors, energies):
    for _ in range(ITERATIONS):
        random_normal_numbers = random.normal(rng_key, shape=(7, NUM_PARTICLES))
        (rng_key, ) = random.split(rng_key, 1)

        position_vectors += random_normal_numbers[0:3, :]
        direction_vectors += random_normal_numbers[3:6, :]
        energies += random_normal_numbers[7, :]

    return rng_key, position_vectors, direction_vectors, energies

to:

@jit
def run_iterations_with_jit(rng_key, position_vectors, direction_vectors, energies):
    return run_iterations(rng_key, position_vectors, direction_vectors, energies)

And the difference was 70 ms for without @jit and 14 ms with @jit. About a factor of 5.

ftessier Dec 31, 2020
Maintainer Author

Indeed, since I have no GPU option yet locally (need to install cuda), I omitted it. On the CPU here (macOS, 1.3 GHz Intel Core i5), adding @jit slowed down the CPU time considerably! (?).

SimonBiggs Dec 31, 2020

slowed down the CPU time considerably

That is odd. I suppose that isn't a deal breaker though, if there are issues like that the production version could detect whether or not a GPU is present (JAX has utilities for that) and then fall back to pure NumPy without any @jit utilised in cases where slow downs are observed.

Not ideal though... Certainly best to avoid splitting things up into cases. It'd be better to get to the bottom of the slow down. Potentially, once we have some working code the JAX team might be able to give us some pointers.

ftessier Dec 31, 2020
Maintainer Author

Yes, I am not worried about it in the least. I have added @jit as a comment in the code, as a reminder to turn it on.

crcrewso · 2020-12-31T17:13:50Z

crcrewso
Dec 31, 2020

When it comes time for testing I have a cluster we can test with. 6 compute nodes and a master node, currently with OpenPBS, next incarnation will be Torque. I'm thinking it might be cleanest to create a new EGSnrc - Testing repository with folders for each language.

Root:

Global test cases (egs inp, documentation, test case pseudocode)
Language specific tests
-- Compiler tests
-- Timing tests
-- implementations of test cases.

We could start with this today. Now that I think about it I can see adding some testing to this infrastructure right now would be a big help as I continue to test Clang/Flang vs GCC versions. It might make it easier for others to share their testing tools too.

I think it would make sense to include performance tests like the one above to keep track of tooling changes.

0 replies

SimonBiggs · 2020-12-31T22:52:39Z

SimonBiggs
Dec 31, 2020

This is quite exciting! 🙂 🎉

@ftessier did you want me to still build out a lookup table based adjustment approach in JAX to test out its timing? I'd be happy to do that. Does EGSnrc interpolate between the table entries? If so, is this just done bi-linearly? (Or n-linearly depending how many dimensions the table has...)

7 replies

SimonBiggs Jan 1, 2021

JAX "jited" functions can't have the above dataclass object as it's input, however, a wrapper "syntax sugar" function can. Essentially, under the hood, the JAX jit will run on array input/output functions. But then to make a nice API that doesn't let you do silly things, wrapper functions can be made. See the following wrapper as an example:

https://colab.research.google.com/drive/1QZKnByThJrq9NhgBbFb8Nx6cPybFVWZJ#scrollTo=H9cPDHcDWni3&line=5&uniqifier=1

SimonBiggs Jan 1, 2021

Also... if it helps to get ideas, helper methods can be defined defined on the dataclasses. For example:

@dataclass
class Particles:
    position: jnp.DeviceArray
    direction: jnp.DeviceArray
    energy: jnp.DeviceArray

    def __post_init__(self):
        if len(self.position.shape) != 2 or self.position.shape[0] != 3:
            raise ValueError("Position should be a 3xN array")

        if len(self.direction.shape) != 2 or self.direction.shape[0] != 3:
            raise ValueError("Direction should be a 3xN array")

        if len(self.energy.shape) != 2 or self.energy.shape[0] != 1:
            raise ValueError("Energy should be a 1xN array")

        is_number_of_particles_consistent = (
            self.position.shape[-1] == self.direction.shape[-1] and 
            self.position.shape[-1] == self.direction.shape[-1]
        )
        if not is_number_of_particles_consistent:
            raise ValueError(
                "Numbers of particles should be constant across the arrays"
            )

        self.num_particles = self.position.shape[-1]

    @classmethod
    def zeros(cls, num_particles):
        return cls(
            position=jnp.zeros((3, num_particles)), 
            direction=jnp.zeros((3, num_particles)),
            energy=jnp.zeros((1, num_particles))
        )

    def random_walk(self, rng_key):
        random_normal_numbers = random.normal(
            rng_key, shape=(7, self.num_particles))
        (rng_key, ) = random.split(rng_key, 1)

        self.position += random_normal_numbers[0:3, :]
        self.direction += random_normal_numbers[3:6, :]
        self.energy += random_normal_numbers[7, :]

        return rng_key

Which can then be used as:

seed = 0
rng_key = random.PRNGKey(seed)

particles = Particles.zeros(NUM_PARTICLES)

for _ in range(ITERATIONS):
    rng_key = particles.random_walk(rng_key)

...but to do that we are loosing the ability to run @jit on the top level code. Anyway, details. I am happy to go the object route for syntax tidiness, and we can always work out a way later to abstract out the @jit code.

ftessier Jan 1, 2021
Maintainer Author

All right, thank you for all the ideas, interesting read(s)! By the way, it occurs to me (only now!) that speedup with NumPy and @jit arises out of operating on arrays in one go (if I understand correctly). However, in EGSnrc we don't update the stack of particles in that way: particles are transported individually, so we update a single particle at a time. Is this possible with Jax, to address one array location at a time? Does that not kill efficiency gains and make our whole performance experiment up to now rather moot?

ftessier Jan 1, 2021
Maintainer Author

Or maybe this is the new way to look at it: compute the updates for all particles first, and update them all at once? Not sure how much is gained though in this case, since writing the particle updates the becomes a tiny part of the overall computation.

SimonBiggs Jan 1, 2021

compute the updates for all particles first

I think the key addition here, is that computing the updates for all particles can also be vectorised without much issue I imagine.

ftessier · 2021-01-01T17:09:33Z

ftessier
Jan 1, 2021
Maintainer Author

So indeed, when I modify the python code to do what the C++ code does, using NumPy (no JAX) but updating particles one at a time (yet still generating all random numbers at "once" with numpy.random.rand(7, NUM_PARTICLES)), execution time is 10x slower than C++. Is it possible to coax JAX into performance gains when operating on a single particle at a time. Or, as I said above, think differently about particle operations to effectively vectorize them?

4 replies

SimonBiggs Jan 1, 2021

Is it possible to coax JAX into performance gains when operating on a single particle at a time. Or, as I said above, think differently about particle operations to effectively vectorize them?

Yeah, the magic comes from the vectorisation approach. All mathematical operations are able to be run on entire numpy arrays in their vectorised forms. It does mean that you do a batch of particles at a time. This is not as limiting as might be thought. NumPy has masking operations and the like, so one could apply a "material" mask where only particles in certain materials have the operations applied to them via the masking approach. Or one could mask the array operations by energy thresholds or anything else.

I have found that most things arithmetical that one would normally express in a loop can also be expressed as NumPy operations. We may need to think differently about how the particle updates are undergone, but I don't believe this is a deal breaking issue.

SimonBiggs Jan 1, 2021

Is it possible to coax JAX into performance gains when operating on a single particle at a time.

In my formulation of how I thought we would go, my thoughts was that the @jix decorator would only be applied when we absolutely couldn't vectorise as I doubted even a @jit would be able to beat the raw vectoised approach. But, I may be wrong.

ftessier Jan 1, 2021
Maintainer Author

You are more optimistic than I on this point (and that's good). No doubt there are ways to "vectorize" the simulation, but it should not come at the expense of making the code obfuscated (we want to move away from that! 😉 ). We can look at different options, but do this early. So it comes full circle to the beginning of the conversation: we need a simple but realistic MC transport algorithm to test properly AND try different strategies.

One could vectorize by creating (flat or object) arrays of particles, so that single particle update are a vector operation. We could vectorize howfar calls etc. on the whole stack, howfar is pure and independent between particles. Or else adopt a STOPS-like approach and transport sets of particles grouped by similar properties (or interactions, say all the compton interactions could be run together first, etc., resorting to masks as you mentioned). For the most part interactions can be turned into pure functions (if the particle update is taken out of the function), so these might be JAX-friendly.

SimonBiggs Jan 1, 2021

but it should not come at the expense of making the code obfuscated

Yup, I am certainly all for that. More than happy to keep goal number 1 be making the code readable and understandable. My hope is that both NumPy based vectorisation and readability can work together most of the time. We'll see. I think stepping through these toy problems piece by piece as you suggested is a great way to build confidence that we can achieve this aims in unison.

I likely won't get it to it for another couple of days though (I have a good mate over at the moment).

ftessier · 2021-01-02T00:31:38Z

ftessier
Jan 2, 2021
Maintainer Author

To be fair to the C++ code, I should have generated the random numbers outside the loop, as in:

std::vector<double> random_vector(7*NUM_PARTICLES);
std::generate(begin(random_vector), end(random_vector), bind(sample, generator));

// update particle arrays
for (int i=0; i<ITERATIONS; i++) {
    int k = 0;
    for (int n=0; n<NUM_PARTICLES; n++) {
        x[n] += random_vector[k++];
        y[n] += random_vector[k++];
        z[n] += random_vector[k++];
        u[n] += random_vector[k++];
        v[n] += random_vector[k++];
        w[n] += random_vector[k++];
        E[n] += random_vector[k++];
    }
}

which further improves the performance about threefold (then one could also write vector operations for the update; not sure what the optimizer does with it):

duration = 273 milliseconds
duration = 244 milliseconds
duration = 239 milliseconds
duration = 289 milliseconds
duration = 250 milliseconds
duration = 245 milliseconds
duration = 239 milliseconds
duration = 234 milliseconds
duration = 250 milliseconds
duration = 238 milliseconds

4 replies

ftessier Jan 2, 2021
Maintainer Author

Interestingly, object encapsulation has a small if any performance footprint (but I am probably not doing this right, I am just playing newbie here 😄 ; in particular the wall clock timer variance is too high!):

class Particle {
public:
    double x, y, z;
    double u, v, w;
    double E;
};

int main () {
    std::vector<Particle> particles(NUM_PARTICLES);

    // ...

        // update particle arrays
        for (int i=0; i<ITERATIONS; i++) {
            int k = 0;
            for (auto p = particle.begin(); p != particle.end(); ++p) {
                p->x += random_vector[k++];
                p->y += random_vector[k++];
                p->z += random_vector[k++];
                p->u += random_vector[k++];
                p->v += random_vector[k++];
                p->w += random_vector[k++];
                p->E += random_vector[k++];
            }
        }

    // ...

    return EXIT_SUCCESS;
}

yields

duration = 241 milliseconds
duration = 206 milliseconds
duration = 219 milliseconds
duration = 223 milliseconds
duration = 213 milliseconds
duration = 286 milliseconds
duration = 265 milliseconds
duration = 233 milliseconds
duration = 238 milliseconds
duration = 251 milliseconds

In a real simulation, the particles are not accessed serially as in this simple example, so the optimizer won't be able to be as aggressive, I presume. Encapsulation helps with cache locality when the particle arrays are very large.

ftessier Jan 4, 2021
Maintainer Author

Oh wow, what was I thinking? 😨 I took the random number generation out of the ITERATIONS loop instead of just the innermost particle loop; big mistake! The correct version is:

for (int i=0; i<ITERATIONS; i++) {

    std::vector<double> random_vector(7*NUM_PARTICLES);
    std::generate(begin(random_vector), end(random_vector), bind(sample, generator));

    int k = 0;
    for (auto p = particle.begin(); p != particle.end(); ++p) {
        p->x += random_vector[k++];
        p->y += random_vector[k++];
        p->z += random_vector[k++];
        p->u += random_vector[k++];
        p->v += random_vector[k++];
        p->w += random_vector[k++];
        p->E += random_vector[k++];
    }
}

crcrewso Jan 4, 2021

Have you tried using perf before to profile your code. Might be better than time.

ftessier Jan 4, 2021
Maintainer Author

Yes, good idea! I used perf recently to count the flops per history in an EGSnrc simulation. It's neat, and I will use it now that you remind me of it... I cannot use it on macOS (and I can't bring myself to install and use Instruments for this). Also looking into Google perf tools. On the python side I have used cProfile + pyprof2calltree (see below).

SimonBiggs · 2021-01-03T02:24:44Z

SimonBiggs
Jan 3, 2021

So, I've made a Particles Dictionary that is amenable to JAX jitting. On Google's colaboratory the timings are much the same as before:

# CPU:

# random_walk duration: 5368.768 ms
# random_walk duration: 1387.357 ms
# random_walk duration: 1339.349 ms
# random_walk duration: 1359.297 ms
# random_walk duration: 1344.687 ms
# random_walk duration: 1326.127 ms
# random_walk duration: 1358.951 ms
# random_walk duration: 1382.427 ms
# random_walk duration: 1355.318 ms
# random_walk duration: 1377.143 ms


# GPU:

# random_walk duration: 1373.592 ms
# random_walk duration: 15.000 ms
# random_walk duration: 14.722 ms
# random_walk duration: 14.642 ms
# random_walk duration: 21.203 ms
# random_walk duration: 14.714 ms
# random_walk duration: 14.782 ms
# random_walk duration: 14.749 ms
# random_walk duration: 14.655 ms
# random_walk duration: 14.642 ms

This is using the same timing approach as @ftessier used at the beginning of this discussion for consistency.

I've copied in the script for the timing tests below, but it is also available at https://github.com/SimonBiggs/egsnrc2py/blob/787eef7e28dbad2d61ef1da90343413763295259/prototyping/mypy_based_particles.py#L1-L82

import time
from typing import Dict, Tuple
from typing_extensions import Literal

import matplotlib.pyplot as plt

from jax import jit, random
import jax.numpy as jnp

ParticleKeys = Literal["position", "direction", "energy"]
Particles = Dict[ParticleKeys, jnp.DeviceArray]


def random_walk(
    prng_key: jnp.DeviceArray, particles: Particles, iterations: int,
) -> Tuple[jnp.DeviceArray, Particles]:
    num_particles = particles["position"].shape[-1]

    for _ in range(iterations):
        random_normal_numbers = random.normal(prng_key, shape=(7, num_particles))
        (prng_key,) = random.split(prng_key, 1)

        particles["position"] += random_normal_numbers[0:3, :]
        particles["direction"] += random_normal_numbers[3:6, :]
        particles["energy"] += random_normal_numbers[7, :]

    return prng_key, particles


random_walk = jit(random_walk, static_argnums=(2,))


def timer(func):
    def wrap(*args, **kwargs):
        start = time.time()
        ret = func(*args, **kwargs)

        # See https://jax.readthedocs.io/en/latest/async_dispatch.html
        # for why this is needed.
        _, particles = ret
        for _, item in particles.items():
            item.block_until_ready()

        stop = time.time()
        duration = (stop - start) * 1000.0
        print("{:s} duration: {:.3f} ms".format(func.__name__, duration))
        return ret

    return wrap


random_walk = timer(random_walk)


def particles_zeros(num_particles: int) -> Particles:
    particles: Particles = {
        "position": jnp.zeros((3, num_particles)),
        "direction": jnp.zeros((3, num_particles)),
        "energy": jnp.zeros((1, num_particles)),
    }

    return particles


def main():
    seed = 0
    prng_key = random.PRNGKey(seed)
    num_particles = int(1e6)
    iterations = 10
    runs = 10

    particles = particles_zeros(num_particles)

    for _ in range(runs):
        prng_key, particles = random_walk(prng_key, particles, iterations)

    plt.scatter(particles["position"][0, 0:1000], particles["position"][1, 0:1000])
    plt.show()


if __name__ == "__main__":
    main()

The resulting plot looks like:

@ftessier and @darcymason does this approach seem to appropriately address the following concern:

We should also test using a tidy Particle class, instead of flat arrays: I would really like to move to a particle object instead of an unruly and dispersed stack variables

14 replies

SimonBiggs Jan 5, 2021

Also, does the following do the trick?

jax-ml/jax#743 (comment)

Or if that doesn't work, the following seems promising:

jax-ml/jax#1539 (comment)

And some further info:

jax-ml/jax#1539 (comment)

SimonBiggs Jan 5, 2021

... I think I may have a solution.

Running:

taskset -c 0 mpiexec -np 1 pyegsnrc walk

Appears to force it to run on one CPU only. Props to the following answer:

https://stackoverflow.com/a/41143396/3912576

Comparing pyegsnrc walk vs taskset -c 0 mpiexec -np 1 pyegsnrc walk:

ftessier Jan 5, 2021
Maintainer Author

Thanks for the tips, I had also stumbled upon taskset, neat little command! You can see it in action with perf stat --all-cpu --no-aggr. In fact I have been playing with perf all day to come up with a decent way to assess performance semi-consistently across languages. Interesting rabbit hole, to say the least. I have lots of notes---mostly so I don't forget!---which I will post (probably on the pyegsnrc) today or tomorrow.

SimonBiggs Apr 8, 2021

Hi @ftessier and @darcymason,

I've picked up my machine learning rig, and I get the following timings for running the random walk test:

pip install pyegsnrc==0.2.0
pyegsnrc walk 100000000 1 20

Gives:

# CPU

WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
random_walk duration: 10197.614 ms
random_walk duration: 8629.484 ms
random_walk duration: 8809.385 ms
random_walk duration: 8671.277 ms
random_walk duration: 8612.254 ms
random_walk duration: 8637.986 ms
random_walk duration: 8601.144 ms
random_walk duration: 9030.495 ms
random_walk duration: 8497.186 ms
random_walk duration: 8724.226 ms
random_walk duration: 8635.714 ms
random_walk duration: 8588.201 ms
random_walk duration: 8646.636 ms
random_walk duration: 8641.758 ms
random_walk duration: 8689.113 ms
random_walk duration: 8600.392 ms
random_walk duration: 8624.979 ms
random_walk duration: 8542.255 ms
random_walk duration: 8517.050 ms
random_walk duration: 8749.615 ms

# GPU

random_walk duration: 338.075 ms
random_walk duration: 39.675 ms
random_walk duration: 39.722 ms
random_walk duration: 39.639 ms
random_walk duration: 39.649 ms
random_walk duration: 39.671 ms
random_walk duration: 39.734 ms
random_walk duration: 39.742 ms
random_walk duration: 39.741 ms
random_walk duration: 39.700 ms
random_walk duration: 39.716 ms
random_walk duration: 39.714 ms
random_walk duration: 39.678 ms
random_walk duration: 39.722 ms
random_walk duration: 39.702 ms
random_walk duration: 39.717 ms
random_walk duration: 39.747 ms
random_walk duration: 39.685 ms
random_walk duration: 39.727 ms
random_walk duration: 39.734 ms

A 200x speed up... not shabby...

darcymason Apr 9, 2021

A 200x speed up... not shabby...

Nice!

SimonBiggs · 2021-01-03T04:52:59Z

SimonBiggs
Jan 3, 2021

That is a logical next step, although I don't know by heart the format of the data files, so I need to look into this too! There are all sorts of interpolations, and a common one is a linear interpolation in log(E), the energy. One could use HEN_HOUSE/data/nrc_brems.data, which describes the format in the header (Oh, I wrote that, he he!). No need to really understand the data anyways at this point: just load the data in memory, and for each x, u, E update "sample" a number in the file and update the particle's fields based on the value somehow, to start reproducing the memory flow pattern.

Addressing this next.

1 reply

SimonBiggs Jan 3, 2021

So, running the following two commands should give you the tri-linear interpolation timings pulling from the bremsstrahlung data:

pip install pyegsnrc==0.1.0.dev2
pyegsnrc lookup

On Google Colab, the CPU vs GPU timings for the above looks like the following:

# CPU timing on Google Colaboartory:
# _iterations duration: 3868.488 ms
# _iterations duration: 2401.599 ms
# _iterations duration: 2425.017 ms
# _iterations duration: 2532.639 ms
# _iterations duration: 2436.504 ms
# _iterations duration: 2319.901 ms
# _iterations duration: 2390.284 ms
# _iterations duration: 2384.178 ms
# _iterations duration: 2454.509 ms
# _iterations duration: 2433.914 ms

# GPU timing on Google Colaboratory:
# _iterations duration: 3023.384 ms
# _iterations duration: 442.495 ms
# _iterations duration: 451.784 ms
# _iterations duration: 429.913 ms
# _iterations duration: 445.081 ms
# _iterations duration: 445.092 ms
# _iterations duration: 429.942 ms
# _iterations duration: 447.981 ms
# _iterations duration: 452.368 ms
# _iterations duration: 446.351 ms

Not a massive speed up for GPU there 🙁, but there may be plenty of low hanging fruit for speed ups. I did just throw it together. I'd be interested to see how C++ fairs in comparison.

The key files that the above CLI command calls are the following:

Of note, I also opted to reformat the brem data files:

https://github.com/SimonBiggs/egsnrc2py/blob/7a10127029e95939d0ba18cf27bd54082ad5178d/prototyping/pyegsnrc/data/bremsstrahlung/Z006.csv

ftessier · 2021-01-03T14:50:58Z

ftessier
Jan 3, 2021
Maintainer Author

Accessing the data files is perhaps not the next logical step. In light of the discussion regarding vectorization, it makes more sense to me now to focus on the vectorization logic.

Let's consider only photons for clarity, since for electrons a number of complications aris if one uses multiple-scattering. Say you have a list of photons, in an infinite medium, and 4 interaction (Rayleigh, photoelectric, Compton, pair production). We need to show significant efficiency gains with JIT for the following sequence:

generate a random numbers for each photon (already vectorized in prng)
sample the exponential distribution (each with its own cross-section value, depending on energy) for the distance to the next interaction (easily vectorized I presume)
update photon positions (already vectorized in foregoing examples, could be done at the end)
pick the interaction based on a random number and the relative interaction cross-sections (easily vectorized I presume)
perhaps sort or mask the photon array if necessary to vectorize the interaction function call (next step)
sample the interaction, using bogus function calls for now (can @jit efficiently vectorize (pure) function calls, i.e., map func() to array?)
update the photons' direction and energy
create an array of all the outgoing particles (except for Rayleigh).

2 replies

SimonBiggs Jan 4, 2021

Let's track this task over at SimonBiggs/pyegsnrc#1.

darcymason Jan 17, 2023

@ftessier, I've made some progress on this in darcymason/egsnrc#19, using a GPU kernel-based approach rather than vectorized (after also playing with vectorized and seeing some difficulties there). Doesn't completely check all your boxes yet, but I also have working Compton code to sample angles and energies that can be folded in, and a Medium class for loading media cross-sections and calculating the regularized sigma tables from them.

ftessier · 2021-01-03T15:27:31Z

ftessier
Jan 3, 2021
Maintainer Author

Soon we'll need some better profiling tools: even for the toy codes up to now, I find large variances in wall clock time due to system load. For python scripts there is the cProfile module, which can produce profiling data, and qcachegrind to display results visually. I wonder if this can profile JITted code?

pip install pyprof2calltree
brew install qcachegrind  # on macOS, consider kcachegrind on linux

python -m cProfile -o data.cprof myscript.py
pyprof2calltree -k -i data.cprof

2 replies

SimonBiggs Jan 3, 2021

Soon we'll need some better profiling tools

Because of how the JAX code runs on the GPU profiling is a little non-standard. Nevertheless there is a suite of profiling tools available for use:

https://jax.readthedocs.io/en/latest/profiling.html

ftessier Jan 3, 2021
Maintainer Author

Ah! I was just about to say that cProfile does not work with JAX code. In fact, it does not even handle straight numpy calls (unless those calls are placed inside a wrapper function for profiling). Thanks for the hint!

ftessier · 2021-01-04T00:12:50Z

ftessier
Jan 4, 2021
Maintainer Author

@SimonBiggs do you know how this can work on a local computer, i.e., not on colab. JAX (XLA) supports only CUDA, correct?

1 reply

SimonBiggs Jan 4, 2021

Yup only supports CUDA to get it running on your graphics card. See the JAX install instructions below:

https://github.com/google/jax#pip-installation

The first two paragraphs refer to CPU only, need to skip those and install the GPU supported version.

SimonBiggs · 2021-01-04T05:34:16Z

SimonBiggs
Jan 4, 2021

Just a heads up @ftessier and @darcymason,

All transpilation prototyping work will be undergone within the https://github.com/darcymason/egsnrc2py repo. All JAX vectorisation prototyping will by undergone within the https://github.com/SimonBiggs/pyegsnrc repo. Probably easier to discuss each component within corresponding issues within each repo.

0 replies

berceanu · 2022-10-23T17:28:38Z

berceanu
Oct 23, 2022

Julia can also be an option if you don't want to deal with vectorisation.

0 replies

ftessier · 2023-02-16T11:29:04Z

ftessier
Feb 16, 2023
Maintainer Author

Numba seems to be of interest: https://numba.pydata.org/

11 replies

darcymason Feb 16, 2023

@ftessier I'll see if I can get you some timing results today or tomorrow with the latest (photon only). I've been trying to assess statistical equivalence (for interaction counts at least) in a simple slab geometry, I'm just at the point of adding batches to get mean and stddev.

darcymason Feb 16, 2023

@ftessier, for a reasonably fair comparison, I need to turn electron transport off in the Mortran code. It's there an easy way to do that? I put ecut at the incident photon energy, not sure if that actually works or if there is a better way.

I'll also mention that of course with timing on GPU for Monte Carlo like this (completely independent histories), it can mostly scale with how much GPU you throw at it. I'll just use the default free Colab level as a starting point.

ftessier Feb 16, 2023
Maintainer Author

Awesome! To turn off electron transport in the motran code, just change the $CALL-USER-ELECTRON macro, either in HEN_HOUSE/src/egsnrc.macros, or override in the application macros:

REPLACE {$CALL-USER-ELECTRON} WITH {RETURN;}

Even a simple fuzz performance test with random inputs on the PHOTON routine might provide a useful comparison. Counting CPU operations with perf would be ideal, since that is independent of the CPU clock speed and system load.

I could provide the average metrics for the mortran code, per PHOTON call.

darcymason Feb 16, 2023

@ftessier the macro replacement just caused the program to hang, any idea how to fix it? I just put the replace line in the user code.

ftessier Feb 17, 2023
Maintainer Author

Oh 🤦🏻 of course the particle is not moving so it keeps trying to take a step! Try

REPLACE {$CALL-USER-ELECTRON} WITH {wt(np)=0; RETURN;}

This sets the particle's weight to 0 so it will be killed, hence it won't stay on the STACK and the code will continue with the next particle popped off the stack.

Otherwise, you can trap the BeforeTransport event in ausgab, and set wt(np)=0 if iq(np)!=0 (careful, on the c++ side there is an index offset of 1 in np, and 2 in region number.

darcymason · 2023-02-21T19:12:21Z

darcymason
Feb 21, 2023

@SimonBiggs @darcymason Do you have some timing comparisons between the original code and your python implementation?

Still very much a work in progress... but here are some results.

First, a side-note: the new REPLACE call still didn't work. I added an np = np - 1 to just get rid of the particle completely which now seems to work and the code is much faster as expected.

The really good news is that I've compared counts for Compton/Photo (both initial and 'indirect' interactions from scattered photons) for 10M particles (1 MeV to avoid pair/triplet for now) in my 'thin two-slab" geometry (two different materials) and the counts by region are statistically the same for the Python code and the Mortran code. So this is now more than just a toy example, but moving closer to a first-draft of photon-only MC.

The speed is very hard to capture in the GPU code - I'm just using the free shared-resources Colab and the times can vary very widely - e.g. the same kernel code can run in ~60 ms or ~3500 ms. If I do a run with 5M photons and then 10M, and take the difference, it is in the range of 50 - 80 ms (taking the difference also removes the jit compile time).

Using the lowest time as probably the closest to reality, I'm getting about factor 15 faster than the single-CPU Mortran code on my laptop. However, I don't know how much faster these could be in a dedicated GPU. The Python times, btw, just include the "kernel" time - so not the setup, transfer of data to the device or back, and not the summing of interactions counts on CPU afterwards. But those total ~ 30 sec or less.

I suspect access to slow memory is holding the GPU back. I've started playing around with changing the memory to accumulate counts in faster GPU shared block memory (not quite working yet). The results above were from one global 'score' array with entries storing interaction counts for each thread individually, meaning a lot of access (10M threads * 4 regions * 4 interaction types) to the slowest memory.

I intend to keep playing with the shared memory scoring, and once that is worked out, try pair/triplet and confirm it is also agreeing with the Mortran code.

14 replies

ftessier Feb 22, 2023
Maintainer Author

@rtownson that is why the python track is interesting. Most of the GPU lifting is done via function decorators, see for example this decorator for example. These are transparent when compiling with a CPU target (as far as I understand!).

darcymason Feb 22, 2023

I think that we have to maintain high performance CPU support for EGSnrc - we can't switch it to GPU-only. Too many people already have CPU clusters, or don't have access to GPUs (e.g. in developing nations). So any GPU support would have to minimize code-branching and duplication, and not require a lot of tuning for specific GPU hardware.

I do think that free (e.g. Colab) or nearly free GPU is out there (cents per GPU-hour, which is a lot of computing power), and to me this project would mature over a long enough time-frame that cheap (at least low-end) GPUs will be even more ubiquitous than now.

If we do code restructuring for GPU optimization, it would have to be compatible with CPU and not introduce a huge overhead for non-GPU runs. Is that feasible in the modern GPU coding landscape?

GPU re-coding can be minimal if you aren't worried about getting the maximum speed out. There should be ways to make reasonable gains without loss of generality (e.g. tracking photons and electrons in separate GPU kernels because they will have the most possibility of parallel code paths).

Would coding with GPU support make new collaborator developments too difficult, because everyone would have to be a GPU expert?

I knew nothing going into this and have picked it up fairly quickly. Some good contributor documentation could cover the basics well enough, I think. And Numba has decent documentation on this and large user groups for helping.

darcymason Feb 22, 2023

did not mean to be callous, I very well imagine that the GPU port is not easy 😄 I simply meant "easy" compared to, say, rewriting in Cuda C.

I didn't mean to imply that you were callous 😄, I genuinely just wanted to understand the perspective.

The reason I am interested in the baseline is that I want to know what is the performance hit solely on account of the Python reimplementation. I am thinking in terms of Numba or Spiral compilation down the road. One could argue that implementing GPU in C for EGSnrc would still be much faster than Python. If the single-core baseline is on par with the current implementation (with a better compiler), then this argument becomes much weaker.

Well, of course in pure Python, the performance hit is tremendous. In jitted Python, there will probably be some loss compared with Mortran or C, just because the compilers (and the existing EGSnrc settings) are very well tuned to optimize the code, but I doubt that it is order of magnitude difference at all, hopefully only 20% or so,... but I say that without any experience to back it up.

At any rate, the GPU specs are in the end most relevant to users, in the sense that it provides a real life example of efficiency gains. However it remains difficult to compare between GPU chipsets.

Again, for me the key element in this is the language change - I'm very much a Python enthusiast and have been for many years. So I came back to this after some original tinkering a year ago because it was an interesting problem and I didn't have any consulting on the go.

I can try to CPU-jit the code again, but I don't want to split my time if that isn't easily resolved. I take it as a general programming philosophy to minimize coding complications as much as possible, i.e. be clear about your target environment. Right now the code (in the photon1 branch) can do un-compiled Python on CPU (for debugging, because back and forth to Colab had more 'friction' for testing things) or GPU.

Perhaps we should have a 'zoom' meeting, if you're interested. There are a lot of nuances here that are hard to communicate by text - I've gained a lot of perspective based on what I've seen and done so far, and you have a large history and knowledge of your user base, ideas for directions that I don't know much about.

ftessier Feb 22, 2023
Maintainer Author

Sounds good, let's continue the discussion in https://github.com/darcymason/egsnrc, and set up a video call to chat about this.

darcymason Feb 24, 2023

Hi @ftessier, @rtownson, @SimonBiggs, and anyone else interested, I've opened a Discussion at darcymason/egsnrc#21. I'll soon post about setting up a video chat.

Python JAX/Numpy implementation of EGSnrc #658

ftessier Dec 31, 2020 Maintainer

Replies: 14 comments · 64 replies

ftessier Dec 31, 2020 Maintainer Author

ftessier Dec 31, 2020 Maintainer Author

ftessier Dec 31, 2020 Maintainer Author

ftessier Jan 1, 2021 Maintainer Author

ftessier Jan 1, 2021 Maintainer Author

ftessier Jan 1, 2021 Maintainer Author

ftessier Jan 1, 2021 Maintainer Author

ftessier Jan 2, 2021 Maintainer Author

ftessier Jan 2, 2021 Maintainer Author

ftessier Jan 4, 2021 Maintainer Author

ftessier Jan 4, 2021 Maintainer Author

ftessier Jan 5, 2021 Maintainer Author

ftessier Jan 3, 2021 Maintainer Author

ftessier Jan 3, 2021 Maintainer Author

ftessier Jan 3, 2021 Maintainer Author

ftessier Jan 4, 2021 Maintainer Author

ftessier Feb 16, 2023 Maintainer Author

ftessier Feb 16, 2023 Maintainer Author

ftessier Feb 17, 2023 Maintainer Author

ftessier Feb 22, 2023 Maintainer Author

ftessier
Dec 31, 2020
Maintainer

Replies: 14 comments 64 replies

ftessier
Dec 31, 2020
Maintainer Author

ftessier Dec 31, 2020
Maintainer Author

ftessier Dec 31, 2020
Maintainer Author

ftessier Jan 1, 2021
Maintainer Author

ftessier Jan 1, 2021
Maintainer Author

ftessier
Jan 1, 2021
Maintainer Author

ftessier Jan 1, 2021
Maintainer Author

ftessier
Jan 2, 2021
Maintainer Author

ftessier Jan 2, 2021
Maintainer Author

ftessier Jan 4, 2021
Maintainer Author

ftessier Jan 4, 2021
Maintainer Author

ftessier Jan 5, 2021
Maintainer Author

ftessier
Jan 3, 2021
Maintainer Author

ftessier
Jan 3, 2021
Maintainer Author

ftessier Jan 3, 2021
Maintainer Author

ftessier
Jan 4, 2021
Maintainer Author

ftessier
Feb 16, 2023
Maintainer Author

ftessier Feb 16, 2023
Maintainer Author

ftessier Feb 17, 2023
Maintainer Author

ftessier Feb 22, 2023
Maintainer Author