Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to allocate 50331648 bytes for new constant #55

Open
Violet969 opened this issue Jan 1, 2023 · 1 comment
Open

Failed to allocate 50331648 bytes for new constant #55

Violet969 opened this issue Jan 1, 2023 · 1 comment

Comments

@Violet969
Copy link

Violet969 commented Jan 1, 2023

Hi all,
I have a problem when run 'run_alphafold.sh', there always have the error like this.

2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/src/api.py", line 2158, in cache_miss
out_tree, out_flat = f_pmapped
(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f
out = pxla.xla_pmap(
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind
return map_bind(self, fun, *args, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process
return trace.process_map(self, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call
return primitive.impl(f, *tracers, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl
return compiled_fun(*args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call
out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I have 8 nodes of 12G GPU, and 125G mem. Can anyone tell me how to solve it?

@Old-Shatterhand
Copy link
Member

Hi @Violet969,

sorry for the late response. Can you please share the protein sequences to provide as an argument to AlphaFold?

What you report sounds like an out-of-memory problem. Please remember, even though you have multiple GPUs, only one will be used for execution as AlphaFold is not parallelized.

Best, Roman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants