You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
I have a problem when run 'run_alphafold.sh', there always have the error like this.
2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/src/api.py", line 2158, in cache_miss
out_tree, out_flat = f_pmapped(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f
out = pxla.xla_pmap(
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind
return map_bind(self, fun, *args, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process
return trace.process_map(self, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call
return primitive.impl(f, *tracers, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl
return compiled_fun(*args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call
out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
I have 8 nodes of 12G GPU, and 125G mem. Can anyone tell me how to solve it?
The text was updated successfully, but these errors were encountered:
sorry for the late response. Can you please share the protein sequences to provide as an argument to AlphaFold?
What you report sounds like an out-of-memory problem. Please remember, even though you have multiple GPUs, only one will be used for execution as AlphaFold is not parallelized.
Hi all,
I have a problem when run 'run_alphafold.sh', there always have the error like this.
I have 8 nodes of 12G GPU, and 125G mem. Can anyone tell me how to solve it?
The text was updated successfully, but these errors were encountered: