Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU pre-built install on windows crashes on training with luz #1287

Open
Bernie-K opened this issue Mar 1, 2025 · 0 comments
Open

GPU pre-built install on windows crashes on training with luz #1287

Bernie-K opened this issue Mar 1, 2025 · 0 comments

Comments

@Bernie-K
Copy link

Bernie-K commented Mar 1, 2025

I installed a torch GPU pre-built via the script on Windows10. Kind was adjusted to CUDA 12.4. It seems this is the only cuda pre-built currently supported (see #1272 )
I need to install a pre-built binary, because the machine has an older CUDA / Tensorflow / Keras installation.

options(timeout = 600) # increasing timeout is recommended since we will be downloading a 2GB file.
# For Windows and Linux: "cpu", "cu117" are the only currently supported
# For MacOS the supported are: "cpu-intel" or "cpu-m1"
kind <- "cu124"
version <- available.packages()["torch","Version"]
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch")

This downloads a ~2.5 GB zip https://torch-cdn.mlverse.org/packages/cu124/0.14.2/bin/windows/contrib/4.4/torch_0.14.2.zip

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] torch_0.14.2

loaded via a namespace (and not attached):
 [1] processx_3.8.5    bit_4.5.0.1       compiler_4.4.2    R6_2.6.1          magrittr_2.0.3    cli_3.6.4        
 [7] tools_4.4.2       rstudioapi_0.17.1 Rcpp_1.0.14       bit64_4.6.0-1     coro_1.1.0        callr_3.7.6      
[13] ps_1.8.1          rlang_1.1.5      

cuda_is_available() returns TRUE. I can also generate torch_tensors with device = 'cuda' and perform operations e.g. matrix muliplication on them.

However, any model training with luz crashes. Seems like #1275

The example given there crashes as any other training with luz (e.g. conv net on MNIST dataset)

library(luz)
library(torch)

ds <- tensor_dataset(torch_rand(10,118,8),torch_rand(10))

res_lstm <- nn_module(
  initialize = function(num_lags = 118){
    self$preprocess <- function(x){
      device <- x$device
      processed_vector <- torch_zeros(c(dim(x)[1],18,8), device = device)
      processed_vector[,1:8,] <- x[,1:8,]
      start_indices <- seq(9, 108, 11)
      
      for (i in 1:10) {
        start_idx <- start_indices[i]
        window <- x[, start_idx:(start_idx + 10),]
        processed_vector[, i + 8, ] <- torch_mean(window, dim = 2)
      }
      
      return(processed_vector)
    }
    
    self$num_lags <- num_lags
    
    self$res <- nn_sequential(
      nn_flatten(),
      nn_dropout(0.2),
      nn_linear(144,184),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(184,46),
      nn_sigmoid(),
      nn_dropout(0.2),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1)
    )
    
    self$lstm <- nn_lstm(8,46,batch_first = TRUE)
    self$lstm_connection <- nn_sequential(
      nn_sigmoid(),
      nn_linear(46,23),
      nn_sigmoid(),
      nn_linear(23,1))
  },
  
  forward = function(x){
    res <- self$res(self$preprocess(x))
    lstm <- self$lstm_connection(
      self$lstm(torch_flip(x,2))[[1]][,self$num_lags,]
    )
    torch_squeeze(nn_sigmoid()(res + lstm))
  }
)

fitted <- res_lstm %>% 
  setup(
    loss = nn_mse_loss(), 
    optimizer = optim_adam
  ) %>% 
  fit(ds, epochs = 10, dataloader_options =list(batch_size = 5))

The MWE without luz @jarroyoe came up with runs.

library(torch)

x <- torch_rand(10,118,8)
y <- torch_rand(10)

res_lstm <- nn_module(
    initialize = function(){
        self$lstm <- nn_lstm(8,46,batch_first = TRUE)
        self$lstm_connection <- nn_sequential(
            nn_sigmoid(),
            nn_linear(46,23),
            nn_sigmoid(),
            nn_linear(23,1))
    },
    
    forward = function(x){
        lstm <- self$lstm_connection(
            self$lstm(torch_flip(x,2))[[1]][,118,]
        )
        torch_squeeze(nn_sigmoid()(lstm))
    }
)

model <- res_lstm()
optimizer <- optim_adam(params = model$parameters)

for(epoch in 1:100){
	optimizer$zero_grad()
	y_pred <- model(x)
	loss <- torch_mean((y_pred - y)^2)
	cat("Epoch: ", epoch, "   Loss: ", loss$item(), "\n")
	loss$backward()
	optimizer$step()
}

I noticed that this runs on CPU only. Any attempts to move models and data to GPU end up with crashes.
With a pre-built torch CPU binary install on the same machine the examples run as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant