Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GCLDAModel with numba and Parallel #744

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

JulioAPeraza
Copy link
Collaborator

Closes None.

Currently, the GCLDA model is slow, mainly when the vocabulary is large (for example for the 3228 terms from Neurosynth). _update_word_topic_assignments takes ~73 seconds per iteration and _update_peak_assignments takes ~52 seconds per iteration for a total of ~125 seconds of _update() per iteration. Compiling and running these functions with numba reduce these times to 2, 4, and 6 seconds per iteration respectively.

Changes proposed in this pull request:

  • Convert _update_word_topic_assignments and _update_peak_assignments to static methods, and add a decorator to compile and run them with numba in nopython mode.

@JulioAPeraza JulioAPeraza added the refactoring Requesting changes to the code which do not impact behavior label Aug 24, 2022
@codecov
Copy link

codecov bot commented Aug 24, 2022

Codecov Report

Base: 88.56% // Head: 87.68% // Decreases project coverage by -0.87% ⚠️

Coverage data is based on head (fc7c5e6) compared to base (1957869).
Patch coverage: 77.21% of modified lines in pull request are covered.

❗ Current head fc7c5e6 differs from pull request most recent head 0d2939c. Consider uploading reports for the commit 0d2939c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #744      +/-   ##
==========================================
- Coverage   88.56%   87.68%   -0.88%     
==========================================
  Files          38       35       -3     
  Lines        4371     3913     -458     
==========================================
- Hits         3871     3431     -440     
+ Misses        500      482      -18     
Impacted Files Coverage Δ
nimare/annotate/gclda.py 92.04% <77.21%> (-5.27%) ⬇️
nimare/meta/utils.py 95.73% <0.00%> (-0.03%) ⬇️
nimare/dataset.py 89.96% <0.00%> (ø)
nimare/__init__.py
nimare/base.py
nimare/utils.py

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tsalo
Copy link
Member

tsalo commented Aug 24, 2022

With the changes to the two update methods, it seems like they could be converted to functions instead. Is there any benefit to using @staticmethod over making them functions?

@JulioAPeraza
Copy link
Collaborator Author

You're right. The two update methods can be converted to functions. I don't see any benefit in using staticmethod in this case.

Comment on lines 227 to 228
# Float precision issue in Numba: https://github.com/numba/numba/issues/3426
probs_pdf = np.trunc(probs_pdf * (10**12)) / (10**12)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will need to be removed in the future. It is ensuring the sum never goes over 1.0 so that we don't get this error in numba, but this truncation is causing the maps on the example to look different to the ones from previous versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running some tests with Neurosynth and Neuroquery, and got the error again:

File "<...>/lib/python3.9/site-packages/nimare/annotate/gclda.py", line 622, in fit
    self._update(loglikely_freq=loglikely_freq)
  File "<...>/lib/python3.9/site-packages/nimare/annotate/gclda.py", line 681, in _update
    ) = _update_peak_assignments(
  File "<...>/lib/python3.9/site-packages/numba/cpython/randomimpl.py", line 833, in binomial_impl
    raise ValueError("binomial(): p outside of [0, 1]")
ValueError: binomial(): p outside of [0, 1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would np.clip help here?
during normalization, you could do this:
probs_pdf = (probs_pdf / np.sum(probs_pdf)).clip(0, 1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using clip but it still raised the same error. I think it has to do with the way numba has the sum implemented.

@JulioAPeraza
Copy link
Collaborator Author

Given these limitations (precision issue with sum, and codecov of jitted functions), do you think we should try Cython for these two functions?

@tsalo
Copy link
Member

tsalo commented Aug 25, 2022

I think @jdkent or @adelavega would be better able to answer that.

@@ -14,6 +15,247 @@
LGR = logging.getLogger(__name__)


@jit(nopython=True, cache=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would codecov track this function if nopython=False?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Let me try this out.

@JulioAPeraza
Copy link
Collaborator Author

@adelavega _update_peak_assignments is taking a total of 132.477 s per iteration in python (3-4 s in numba). The line (236) that is having issues in numba is the one that takes 20% of the running time.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
.
.
.
   170                                               # Seed random number generator
   171         2        168.0     84.0      0.0      np.random.seed(randseed)
   172                                           
   173                                               # Iterate over all peaks x, and sample a new y and r assignment for each
   174    768919     904087.0      1.2      0.7      for i_ptoken, doc in enumerate(doc_idx):
   175    768918     646397.0      0.8      0.5          topic = peak_topic_idx[i_ptoken]
   176    768918     622581.0      0.8      0.5          region = peak_region_idx[i_ptoken]
   177                                           
   178                                                   # Decrement count-matrices to remove current ptoken_topic_idx
   179                                                   # because ptoken will be reassigned to a new topic
   180    768918    1234560.0      1.6      0.9          ny_region_topic[region, topic] -= 1  # Decrement count in Subregion x Topic count matx
   181    768918    1064501.0      1.4      0.8          ny_doc_topic[doc, topic] -= 1  # Decrement count in Document x Topic count matrix
   182                                           
   183                                                   # Retrieve the probability of generating current x from all
   184                                                   # subregions: [R x T] array of probs
   185    768918    1719597.0      2.2      1.3          p_x_subregions = (peak_probs[i_ptoken, :, :]).transpose()
   186                                           
   187                                                   # Compute the probabilities of all subregions given doc
   188                                                   #     p(r|d) ~ p(r|t) * p(t|d)
   189                                                   # Counts of subregions per topic + prior: p(r|t)
   190    768918    5259387.0      6.8      4.0          p_region_g_topic = ny_region_topic + delta
   191                                           
   192                                                   # Normalize the columns such that each topic's distribution over subregions sums to 1
   193    768918   12123463.0     15.8      9.2          p_region_g_topic_sum = np.sum(p_region_g_topic, axis=0)
   194    768918    3210778.0      4.2      2.4          p_region_g_topic = np.divide(p_region_g_topic, p_region_g_topic_sum)
   195                                           
   196                                                   # Counts of topics per document + prior: p(t|d)
   197    768918    3648791.0      4.7      2.8          p_topic_g_doc = ny_doc_topic[doc, :] + alpha
   198                                           
   199                                                   # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
   200    768918    2823998.0      3.7      2.1          p_topic_g_doc = p_topic_g_doc.repeat(n_regions).reshape((-1, n_regions)).T
   201                                           
   202                                                   # Compute p(subregion | document): p(r|d) ~ p(r|t) * p(t|d)
   203                                                   # [R x T] array of probs
   204    768918    2890817.0      3.8      2.2          p_region_g_doc = p_topic_g_doc * p_region_g_topic
   205                                           
   206                                                   # Compute the multinomial probability: p(z|y)
   207                                                   # Need the current vector of all z and y assignments for current doc
   208                                                   # The multinomial from which z is sampled is proportional to number
   209                                                   # of y assigned to each topic, plus constant gamma
   210                                                   # Compute the proportional probabilities in log-space
   211   1537836    3591206.0      2.3      2.7          logp = nz_doc_topic[doc, :] * np.log(
   212    768918    6299922.0      8.2      4.8              (ny_doc_topic[doc, :] + gamma + 1) / (ny_doc_topic[doc, :] + gamma)
   213                                                   )
   214                                                   # Add a constant before exponentiating to avoid any underflow issues
   215    768918   14197485.0     18.5     10.7          p_peak_g_topic = np.exp(logp - np.max(logp))
   216                                           
   217                                                   # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
   218    768917    2708038.0      3.5      2.0          p_peak_g_topic = p_peak_g_topic.repeat(n_regions).reshape((-1, n_regions)).T
   219                                           
   220                                                   # Get the full sampling distribution:
   221                                                   # [R x T] array containing the proportional probability of all y/r combinations
   222    768917    4631186.0      6.0      3.5          probs_pdf = p_x_subregions * p_region_g_doc * p_peak_g_topic
   223                                           
   224                                                   # Convert from a [R x T] matrix into a [R*T x 1] array we can sample from
   225    768917    5906090.0      7.7      4.5          probs_pdf = np.reshape(probs_pdf, (n_regions * n_topics))
   226                                           
   227                                                   # Normalize the sampling distribution
   228    768917   12366888.0     16.1      9.3          probs_pdf = probs_pdf / np.sum(probs_pdf)
   229                                                   # Float precision issue in Numba: https://github.com/numba/numba/issues/3426
   230    768917    4950658.0      6.4      3.7          probs_pdf = np.trunc(probs_pdf * (10**12)) / (10**12)
   231                                           
   232                                                   # Sample a single element (corresponding to a y_i and c_i assignment for the ptoken)
   233                                                   # from the sampling distribution
   234                                                   # Returns a binary [1 x R*T] vector with a '1' in location that was sampled
   235                                                   # and zeros everywhere else
   236    768917   25842795.0     33.6     19.5          assignment_vec = np.random.multinomial(1, probs_pdf)
   237                                           
   238                                                   # Reshape 1D back to [R x T] 2D
   239    768917    5398268.0      7.0      4.1          assignment_arr = np.reshape(assignment_vec, (n_regions, n_topics))
   240                                                   # Transform the linear index of the sampled element into the
   241                                                   # subregion/topic (r/y) assignment indices
   242    768917    4236942.0      5.5      3.2          assignment_idx = np.where(assignment_arr)
   243                                                   # Subregion sampled (r)
   244    768917    1139169.0      1.5      0.9          region = assignment_idx[0][0]
   245                                                   # Topic sampled (y)
   246    768917     737100.0      1.0      0.6          topic = assignment_idx[1][0]
   247                                           
   248                                                   # Update the indices and the count matrices using the sampled y/r assignments
   249                                                   # Increment count in Subregion x Topic count matrix
   250    768917    1799035.0      2.3      1.4          ny_region_topic[region, topic] += 1
   251                                                   # Increment count in Document x Topic count matrix
   252    768917    1148576.0      1.5      0.9          ny_doc_topic[doc, topic] += 1
   253                                                   # Update y->topic assignment
   254    768917     669923.0      0.9      0.5          peak_topic_idx[i_ptoken] = topic
   255                                                   # Update y->subregion assignment
   256    768917     704382.0      0.9      0.5          peak_region_idx[i_ptoken] = region
   257                                           
   258         1          1.0      1.0      0.0      return ny_region_topic, ny_doc_topic, peak_topic_idx, peak_region_idx

# from the sampling distribution
# Returns a binary [1 x R*T] vector with a '1' in location that was sampled
# and zeros everywhere else
assignment_vec = np.random.multinomial(1, probs_pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says to "use the multinomial method of a default_rng() instance instead". If we're lucky, that might also be faster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried different combinations with:

rng = np.random.default_rng()
assignment_vec = rng.multinomial(1, probs_pdf)

and everything has failed in numba.

Using the correction proposed in this issue https://github.com/numba/numba/issues/3426, the current assignment_vec = np.random.multinomial(1, probs_pdf) is working fine without truncating probs_pdf.

Copy link
Member

@adelavega adelavega Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's good to hear. I'd love to hear what the speed up is using numba on a large-ish dataset. When you get a chance to run some bench marks, ping me.

We then need to decide if its worth the "risk" to fork numba. It is unfortunately a maintenance problem potentially.

@JulioAPeraza
Copy link
Collaborator Author

@adelavega Running times per iteration for 200 topics:

Function Branch Neurosynth (14369 studies, 3228 words) NeuroQuery (13459 studies, 6145 words)
_update_word_topic_assignments master ~96s ~1342s
_update_word_topic_assignments gclda-numba ~10s ~149s
_update_peak_assignments master ~112s ~96s
_update_peak_assignments gclda-numba ~55s ~50s

Typically we have 10,000 iterations, so to train GCLDA on Neurosynth with master would take ~25 days, which can be reduced to ~7.5 days with gclda-numba.

@JulioAPeraza
Copy link
Collaborator Author

The latest changes showed improvement in _update_peak_assignments.

Creating chunks (n_chunks = n_cores) of the data and running each chunk in separate processors reduced the overhead and the number of delayed() calls by the embarrassingly parallel routine. Right now we have 2 nested for loops, one iterates over chunks and it's embarrassingly parallelized, and the other iterates within chunks extending the duration of the task that is submited to Parallel. These changes, besides reducing the overhead, reduce the time that it takes to dispatch so many calls (in our case 500,000) of very fast tasks to workers.

However, this is only scalable to a small number of CPUs. I got small linear scalability for only <= 4 CPUs in a local machine and <= 8 CPUs in an HPC.

What do you all think? Should we:

  1. keep numba support for _update_word_topic_assignments and wait for numba to fix the issue with the multinomial generator to add support for _update_peak_assignments, or
  2. keep both numba support for _update_word_topic_assignments and Parallel for _update_peak_assignments, with a warning to users that the algorithm is only scalable to a small number of CPUs (4-8)?

I think number 1 is good, since using numba for _update_word_topic_assignments already makes GCLDA 10x faster.

@adelavega
Copy link
Member

adelavega commented Oct 6, 2022

@JulioAPeraza awesome! q: why is it only scalable to a small number of CPUs? I don't quite understand that.

but given that limitation, I agree with your reasoning and think #1 is a good option. That's an order of magnitude so not bad.

@JulioAPeraza
Copy link
Collaborator Author

@adelavega
I'm not sure. I guess the scalability problem comes from the cost of communication added by having all the input arrays written to the disk (or copied to memory) in parallel. Although the increase in the number of CPUs reduces the calculation time, it may also increase the communication making the algorithm not scalable.
I just realized that by default Parallel will dump into a file any array larger than 1 megabyte. I ran a couple of tests with max_nbytes=None and now the algorithm is at least weakly scalable to all the cores available in my laptop (8 CPUs). I still need to test this on the HPC with a bigger number of CPUs. With 8 CPUs the calculation time per iteration of _update goes down from ~112s to ~15s.

@adelavega
Copy link
Member

Ah, disk i/o would certainly explain it. I wonder using "threads" would also help: https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism

@JulioAPeraza JulioAPeraza changed the title Optimize GCLDAModel with numba Optimize GCLDAModel with numba and Parallel Oct 10, 2022
@JulioAPeraza
Copy link
Collaborator Author

I tested the scalability of the function with more CPUs in the HPC and it looks good. I also ran some tests using threads but it was slower than the default loky.

This PR is ready for review. Thanks!

@adelavega
Copy link
Member

Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?

https://numba.readthedocs.io/en/stable/user/parallel.html

It looks like the main concern would be the same matrices are being access by the document loop, but it looks like it would be different parts of the arrays?

@adelavega
Copy link
Member

Otherwise, I took a closer look at the GCLDA algorithm, and this all looks reasonable to me. I may have some more nit-picky comments later, but perhaps @jdkent and @tsalo can help with those as well.

@JulioAPeraza
Copy link
Collaborator Author

Thanks, @adelavega.

Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?

I tried parallel=True before in _update_word_topic_assignments and also in _update_peak_assignments running on our fork of numba, and I got a Segmentation Fault error. I will look into it early tomorrow.

Copy link
Member

@jdkent jdkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a minor suggestion for readability for me.

Comment on lines 673 to 679
[
self.data["ptoken_doc_idx"][chunk_i : chunk_i + chunk_size],
old_regions[chunk_i : chunk_i + chunk_size],
old_topics[chunk_i : chunk_i + chunk_size],
peak_probs[chunk_i : chunk_i + chunk_size, :, :],
seeds[chunk_i : chunk_i + chunk_size],
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a dictionary so that the arguments being sent to _update_peak_assignments are explicit?


results = Parallel(n_jobs=self.n_cores, max_nbytes=None)(
delayed(_update_peak_assignments)(
*chunk,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with above, this would now be **chunk

Comment on lines 686 to 693
self.topics["n_peak_tokens_region_by_topic"],
self.topics["n_peak_tokens_doc_by_topic"],
self.topics["n_word_tokens_doc_by_topic"],
self.params["n_regions"],
self.params["n_topics"],
self.params["alpha"],
self.params["delta"],
self.params["gamma"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit redundant, but now the arguments here should be named too (e.g., `gamma = self.params["gamma"]

@jdkent
Copy link
Member

jdkent commented Oct 20, 2022

just want to make note of the current gclda example and the one in this pull request:

current: https://nimare.readthedocs.io/en/latest/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

current decoding ROI:
Term Weight
motor 24.666614
seed 20.181775
structural 17.939355
cortex 15.722663
behavioral 15.696936
methods 11.237680
speech 11.218747
connectivity modeling 11.212097
sca17 11.212097
vbm 8.969678

pull request decoding ROI:
Term Weight
insula 11.288984
connectivity 10.711503
approaches 10.034653
network 8.780633
anterior insula 7.525990
nachr 7.525990
functional 6.869626
voice 6.271658
smokers 5.017326
nachr agonists 5.017326

I suppose I do not know how stochastic the results are on this test dataset since it is very small.

EDIT: consistent since first change in this pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

len(self.vocabulary),
self.params["beta"],
self.params["gamma"],
self.seed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that earlier int he code self.seed = 0 rather than self.seed = self.params["seed_init"].

Could this be part of the issue?

Copy link
Collaborator Author

@JulioAPeraza JulioAPeraza Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably that explains the difference you saw because we are testing with a small number of iterations. I think self.params["seed_init"] is used to initialize the peak_topic and the peak_subregion assignments, whereas self.seed is used to perform the sampling in _update().

Updated number of word-tokens assigned to each topic (T) per document (D).
"""
# --- Seed random number generator
np.random.seed(randseed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely jit does something slightly different here, since the results change if I comment out @jit above.

@JulioAPeraza JulioAPeraza marked this pull request as draft December 8, 2022 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Requesting changes to the code which do not impact behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants