Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python/examples/alpha_zero.py crashes with CUDA_ERROR_NOT_INITIALIZED #1122

Open
jthemphill opened this issue Oct 11, 2023 · 5 comments
Open
Labels
Windows This is about support on the Windows platform

Comments

@jthemphill
Copy link

I'm running Ubuntu 22.04 WSL2, and I've tried running this with both tensorflow==2.14.0 and tf-nightly==2.15.0.dev20231010. I am using Python 3.11.5, which is supported by the latest version of Tensorflow.

You can correctly install Tensorflow with GPU support via pip install --extra-index-url https://pypi.nvidia.com tensorflow[and-cuda], or install the nightly version with pip install --extra-index-url https://pypi.nvidia.com tf-nightly[and-cuda]. Note that, without the extra-index-url flag, the installation will fail as Tensorflow 2.14.0 depends on specific versions of tensorrt and tensorrt-lib which are not in the public pypi repository.

I verified that my graphics card is visible to the WSL2 container:

~/open_spiel$ nvidia-smi
Tue Oct 10 22:58:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.120                Driver Version: 537.58       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        On  | 00000000:01:00.0  On |                  N/A |
| 35%   52C    P0              35W / 180W |    962MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

And I verified that Tensorflow itself runs code correctly with my GPU, by running this code, seeing results, and noting the spike in my GPU's utilization when I run this script:

import tensorflow as tf

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

But even though tensorflow is working with my graphics card, alpha_zero.py fails:

~/open_spiel$ python open_spiel/python/examples/alpha_zero.py 
2023-10-10 22:51:42.689219: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-10 22:51:42.689281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-10 22:51:42.690266: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-10 22:51:42.695684: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-10 22:51:43.360880: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-10 22:51:44.101936: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127175: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127253: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Starting game connect_four
Writing logs and checkpoints to: /tmp/az-2023-10-10-22-51-connect_four-87c21nuk
Model type: resnet(128, 10)
actor-0 started
actor-1 started
learner started
[2023-10-10 22:51:44.141] Initializing model
evaluator-0 started
Exception caught in evaluator-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
evaluator-0 exiting
Process Process-3:
Exception caught in actor-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-0 exiting
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 287, in evaluator
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-10 22:51:44.231365: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231499: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231562: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Exception caught in actor-1: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-1 exiting
Process Process-2:
Traceback (most recent call last):
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
^C2023-10-10 22:51:45.582786: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2017] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-10-10 22:51:45.582959: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.583002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6562 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
[2023-10-10 22:51:45.587] learner exiting
learner exiting

<hangs at 0% GPU usage>
Caught a KeyboardInterrupt, stopping early.

AlphaZero is forking actor, evaluator, and learner processes, and it's these subprocesses which fail, so I believe this is related to tensorflow/tensorflow#57877.

@lanctot lanctot added the Windows This is about support on the Windows platform label Oct 12, 2023
@lanctot
Copy link
Collaborator

lanctot commented Oct 12, 2023

Hi,

This is a pretty complex setup.. I'm not sure how we can help as we don't have a setup like this to reproduce it.

Have you tried the simple program on that thread you linked, i.e. tensorflow/tensorflow#57877 (comment) ?

Did you see that CUDA support on Windows was being removed in TF? tensorflow/tensorflow#59905. According to that thread, it should still work in WSL. Seems like you need WSL2... but it seems like you are indeed using that. So, yeah.. seems like it should work.

@tewalds: any ideas?

@jthemphill
Copy link
Author

I don't think this is a complex setup at all! I installed a recent version of tensorflow[and-cuda], with a GPU that supports CUDA.

I made sure to use the correct dependency versions, even going so far as to track down the missing old version of tensorrt-lib, which should be in PyPi but isn't!

I then ran alpha_zero.py. Does alpha_zero.py work for you when you use it with a GPU?

I did run the failing code example in the linked issue, and it did fail in the same way. It seems to me that alpha_zero.py forks processes in a way that CUDA does not support!

@lanctot
Copy link
Collaborator

lanctot commented Oct 13, 2023

I don't think this is a complex setup at all!

Well, first: OpenSpiel is not officially supported on Windows. We don't have Windows machines easily at our disposal, so we don't test things on Windows hosts and have only run things ourselves within WSL a few times. I have no clue how CUDA drivers are supported through WSL.

Second, you're using most recent nightly versions of TF that we are not testing on our CI regularly (we're only testing 2.12.0, see here. Due to this, the new TF requires a specific/custom older version of tensorrt and tensorrt-lib. Maybe these don't come with CUDA support, or are not getting built properly? 🤷

Third, TF most recently stopped support CUDA on Windows. That should not affect you due to running within WSL, but I wonder if in the process of disabling CUDA on native Windows, something else in the code chain is causing the CUDA issues within your setup. (I realize this is unlikely.)

Then there's a thread that might be related because it's a forking actor ... ?

That feels like a pretty complex setup to me. We'll do our best to help, but without being able to mimic your setup, it will be difficult.

Does alpha_zero.py work for you when you use it with a GPU?

I believe @tewalds might be the only one who has run our Python AlphaZero using CUDA; IIRC it was almost certainly on a native Linux machine, and I believe it was about 3 years ago. 😅

I don't know of any instances of people running the Python TF AlphaZero using CUDA within WSL. I barely know one person who has used it with CUDA, and it was long ago. The more common use is C++ LibTorch version on native Linux machines, because it's faster.

I'd like to know if it currently runs on a Linux machine with CUDA. @tewalds, is it easy for you to try on your desktop? Can you tell me if you run into the same issue?

@tacertain
Copy link
Contributor

I have been updating the python AlphaZero to Keras 3, and I'm running into the same thing. I don't think that it's a Windows problem. There's some challenge with Keras 3 and forking. There are a few forum posts about it, but nothing definitive, e.g. https://stackoverflow.com/questions/33748750/cuda-error-initialization-error-when-using-parallel-in-python. I did try changing the start method to "spawn" in spawn.py, but that didn't fix it.

I might look into whether it's possible to lazy load the core keras libraries. It didn't seem super easy, but I don't have a lot of ideas. It's not super obvious what's going on, because, for example, this sample code runs correctly:

import collections
import datetime
import functools
import itertools
import json
import os
import random
import sys
import tempfile
import time
import traceback

import numpy as np

import keras.callbacks as kcb

from open_spiel.python.algorithms import mcts
from open_spiel.python.algorithms.alpha_zero import evaluator as evaluator_lib
from open_spiel.python.algorithms.alpha_zero import model as model_lib
import pyspiel
from open_spiel.python.utils import data_logger
from open_spiel.python.utils import file_logger
from open_spiel.python.utils import spawn
from open_spiel.python.utils import stats

import tensorflow as tf
import time

def child(queue):
    print("Child GPUs: " + str(tf.config.list_physical_devices('GPU')))

    for i in range(20):
        a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
        b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
        c = tf.matmul(a, b)

        print("Child: " +str(c))


child_proc = spawn.Process(child)
print("Parent GPUs: " + str(tf.config.list_physical_devices('GPU')))

for i in range(20):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    c = tf.matmul(a, b)

    print("Parent: " +str(c))

@tacertain
Copy link
Contributor

I just reproduced this on a metal Ubuntu 20.04 machine with TF 2.16.1 and Keras 3.3.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Windows This is about support on the Windows platform
Projects
None yet
Development

No branches or pull requests

3 participants