Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with XLA devices #119

Open
snoreis opened this issue Mar 4, 2022 · 1 comment
Open

Issue with XLA devices #119

snoreis opened this issue Mar 4, 2022 · 1 comment

Comments

@snoreis
Copy link

snoreis commented Mar 4, 2022

Hi All,

I'm having some trouble running N2V. I have a computer with NVIDIA RTXx A5000 and Ubuntu 18.04.

I use conda to install N2V according to:

$ conda create -n 'n2v' python=3.7
$ source activate n2v
$ conda install tensorflow-gpu=2.4.1 keras=2.3.1
$ pip install jupyter
$ pip install n2v

and then run the jupyter notebook given here.

Everything runs smoothly until I get to the line:

model = N2V(config, model_name, basedir=basedir)

which takes about 5 minutes to execute and I get the following output:

/home/sam/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v/models/n2v_standard.py:416: UserWarning: output path for model already exists, files may be overwritten: /home/sam/models/BSD68_reproducability_5x5
'output path for model already exists, files may be overwritten: %s' % str(self.logdir.resolve()))
2022-03-04 10:33:12.171788: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-03-04 10:33:12.172307: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-03-04 10:33:12.205486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:33:12.205618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:61:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s
2022-03-04 10:33:12.205630: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-03-04 10:33:12.206557: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-03-04 10:33:12.206579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-03-04 10:33:12.207523: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-03-04 10:33:12.207667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-03-04 10:33:12.208436: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-03-04 10:33:12.208839: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-03-04 10:33:12.210569: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-03-04 10:33:12.210663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:33:12.210860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:33:12.210941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-03-04 10:33:12.211272: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-04 10:33:12.212089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:33:12.212184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:61:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s
2022-03-04 10:33:12.212194: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-03-04 10:33:12.212205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-03-04 10:33:12.212212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-03-04 10:33:12.212218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-03-04 10:33:12.212224: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-03-04 10:33:12.212230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-03-04 10:33:12.212236: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-03-04 10:33:12.212243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-03-04 10:33:12.212276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:33:12.212384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:33:12.212461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-03-04 10:33:12.212481: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-03-04 10:38:35.925719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-04 10:38:35.925741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2022-03-04 10:38:35.925746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2022-03-04 10:38:35.925939: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:38:35.926079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:38:35.926191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-04 10:38:35.926283: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-03-04 10:38:35.926306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21899 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:61:00.0, compute capability: 8.6)
2022-03-04 10:38:35.926510: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

If I then continue with:

history = model.train(X, X_val)

I get the following output, after which, it just stops:

/home/sam/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v/models/n2v_standard.py:194: UserWarning: small number of validation images (only 0.1% of all images)
warnings.warn("small number of validation images (only %.1f%% of all images)" % (100 * frac_val))

8 blind-spots will be generated per training patch of size (64, 64).

Preparing validation data: 100%|██████████████████████████████████████| 4/4 [00:00<00:00, 533.12it/s]

Epoch 1/200

2022-03-04 10:40:59.811781: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-03-04 10:40:59.830334: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3892985000 Hz
2022-03-04 10:41:00.455851: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7

Any thoughts??

Thanks!
Sam

@snoreis
Copy link
Author

snoreis commented Mar 5, 2022

OK, so I think I've fixed some of it. I changed the N2V installation procedure to:

conda create --name n2v python=3.7
conda activate n2v
conda install -c anaconda cudatoolkit
conda install cudatoolkit=11.0.*
conda install cudnn=8.0.*
conda install jupyter
pip install tensorflow-gpu==2.4.1
pip install n2v

the key difference being that I used tensorflow-gpu instead of just tensorflow which makes sense. Maybe that can be added to the documentation?

However, the string of warning messages after model = N2V(config, model_name, basedir=basedir) still exists, but executes quickly.

Then, when I run the training I get a repeating message of:

`
2022-03-04 19:57:19.446524: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

Which eventually finishes but the messages cause the training to move very slowly. I know that because if I do a low number of epochs and steps, then execute the training cell again once finished, the error messages go away and everything goes according to plan.

Hope this helps anyone else in need! and if anyone has an idea what to do with the "Your CUDA software stack is old. " message, that would be of great help!

Sam
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant