GPU pre-built install on windows crashes on training with luz #1287

Bernie-K · 2025-03-01T09:49:48Z

I installed a torch GPU pre-built via the script on Windows10. Kind was adjusted to CUDA 12.4. It seems this is the only cuda pre-built currently supported (see #1272 )
I need to install a pre-built binary, because the machine has an older CUDA / Tensorflow / Keras installation.

options(timeout = 600) # increasing timeout is recommended since we will be downloading a 2GB file.
# For Windows and Linux: "cpu", "cu117" are the only currently supported
# For MacOS the supported are: "cpu-intel" or "cpu-m1"
kind <- "cu124"
version <- available.packages()["torch","Version"]
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch")

This downloads a ~2.5 GB zip https://torch-cdn.mlverse.org/packages/cu124/0.14.2/bin/windows/contrib/4.4/torch_0.14.2.zip

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] torch_0.14.2

loaded via a namespace (and not attached):
 [1] processx_3.8.5    bit_4.5.0.1       compiler_4.4.2    R6_2.6.1          magrittr_2.0.3    cli_3.6.4        
 [7] tools_4.4.2       rstudioapi_0.17.1 Rcpp_1.0.14       bit64_4.6.0-1     coro_1.1.0        callr_3.7.6      
[13] ps_1.8.1          rlang_1.1.5

cuda_is_available() returns TRUE. I can also generate torch_tensors with device = 'cuda' and perform operations e.g. matrix muliplication on them.

However, any model training with luz crashes. Seems like #1275

The example given there crashes as any other training with luz (e.g. conv net on MNIST dataset)

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

The MWE without luz @jarroyoe came up with runs.

library(torch)

x <- torch_rand(10,118,8)
y <- torch_rand(10)

res_lstm <- nn_module(
    initialize = function(){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,118,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

model <- res_lstm()
optimizer <- optim_adam(params = model$parameters)

for(epoch in 1:100){
	optimizer$zero_grad()
	y_pred <- model(x)
	loss <- torch_mean((y_pred - y)^2)
	cat("Epoch: ", epoch, "   Loss: ", loss$item(), "\n")
	loss$backward()
	optimizer$step()
}

I noticed that this runs on CPU only. Any attempts to move models and data to GPU end up with crashes.
With a pre-built torch CPU binary install on the same machine the examples run as expected.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU pre-built install on windows crashes on training with luz #1287

GPU pre-built install on windows crashes on training with luz #1287

Bernie-K commented Mar 1, 2025 •

edited

Loading

GPU pre-built install on windows crashes on training with luz #1287

GPU pre-built install on windows crashes on training with luz #1287

Comments

Bernie-K commented Mar 1, 2025 • edited Loading

Bernie-K commented Mar 1, 2025 •

edited

Loading