Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict.konenen is failing if the matrix is too big #13

Open
wcornwell opened this issue Sep 4, 2024 · 14 comments
Open

predict.konenen is failing if the matrix is too big #13

wcornwell opened this issue Sep 4, 2024 · 14 comments

Comments

@wcornwell
Copy link
Collaborator

from: @jack-bilby :

12: aperm.default(X, c(s.call, s.ans))
11: aperm(X, c(s.call, s.ans))
10: apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction)
9: which(apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction))
8: FUN(X[[i]], ...)
7: lapply(data, function(x) which(apply(x, 1, function(y) (sum(is.na(y))/length(y)) >
maxNA.fraction)))
6: check.data.na(newdata, maxNA.fraction = maxNA.fraction)
5: map.kohonen(object, newdata = newdata, whatmap = whatmap.new,
...)
4: map(object, newdata = newdata, whatmap = whatmap.new, ...)
3: predict.kohonen(MSOM, newdata = dd, whatmap = 1)
2: predict(MSOM, newdata = dd, whatmap = 1) at #10
1: classify_and_plot(1/60)

@wcornwell
Copy link
Collaborator Author

possible solution from @dfalster :

data |>

mutate(

chunk = seq(1, n(), by = 10)

) |>

split(~chunk) |>

purrr::map(~predict(MSOM, newdata = .x, whatmap = 1)) |>
purrr::list_rbind()

@wcornwell
Copy link
Collaborator Author

possible solution here: 0ca8092

needs testing...

@wcornwell
Copy link
Collaborator Author

wcornwell commented Sep 4, 2024

worked for me, @jack-bilby . Took about a little over an hour for the file you sent me

Screenshot 2024-09-05 at 8 56 17 AM

@dfalster
Copy link
Member

dfalster commented Sep 4, 2024

Just noting the small MSOM object Will has used here won't give very meaningful results, and will likely be faster than using the full object.

If it's that slow, this could be an argument to parallelise this step?

@jack-bilby
Copy link
Collaborator

Will, is that output with that solution you suggested? Or with the original code?

I think it would be a good idea to at least have an option to chunk/parallelise the processes, especially for larger datasets.

@wcornwell
Copy link
Collaborator Author

Working on it

@wcornwell
Copy link
Collaborator Author

wcornwell commented Sep 5, 2024

Looking like self organizing map predictions are not row-by-row. they take some kind complex window type thing. So when I chunk the imput file, I get edge effects on the chunks.

# Function to compare full data prediction vs chunk predictions
test_predict_kohonen_behavior <- function(dat, MSOM, chunk_size = 100) {
  # Step 1: Full dataset prediction
  full_data_matrix <- as.matrix(dat[, -1])
  full_prediction <- kohonen:::predict.kohonen(MSOM, newdata = full_data_matrix, whatmap = 1)
  full_activity <- full_prediction$predictions$activity
  
  # Step 2: Large chunk from the middle of the dataset
  middle_start <- nrow(dat) %/% 2 - chunk_size %/% 2
  middle_chunk <- dat[seq(from = middle_start, length.out = chunk_size), -1]
  middle_chunk_matrix <- as.matrix(middle_chunk)
  middle_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = middle_chunk_matrix, whatmap = 1)
  middle_chunk_activity <- middle_chunk_prediction$predictions$activity
  
  # Step 3: Edge case - Small chunk from the beginning of the dataset
  first_chunk <- dat[1:chunk_size, -1]
  first_chunk_matrix <- as.matrix(first_chunk)
  first_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = first_chunk_matrix, whatmap = 1)
  first_chunk_activity <- first_chunk_prediction$predictions$activity
  
  # Step 4: Edge case - Small chunk from the end of the dataset
  last_chunk <- dat[(nrow(dat) - chunk_size + 1):nrow(dat), -1]
  last_chunk_matrix <- as.matrix(last_chunk)
  last_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = last_chunk_matrix, whatmap = 1)
  last_chunk_activity <- last_chunk_prediction$predictions$activity
  
  # Step 5: Comparison of full dataset predictions with chunk predictions
  
  # Middle chunk comparison
  middle_full_activity <- full_activity[middle_start:(middle_start + chunk_size - 1)]
  middle_comparison <- middle_chunk_activity == middle_full_activity
  middle_na <- is.na(middle_chunk_activity) | is.na(middle_full_activity)
  
  cat("Middle chunk comparison:\n")
  if (all(middle_comparison[!middle_na])) {
    cat("Middle chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in middle chunk predictions at indices: ", which(!middle_comparison[!middle_na]), "\n")
  }
  
  # First chunk comparison
  first_full_activity <- full_activity[1:chunk_size]
  first_comparison <- first_chunk_activity == first_full_activity
  first_na <- is.na(first_chunk_activity) | is.na(first_full_activity)
  
  cat("First chunk comparison:\n")
  if (all(first_comparison[!first_na])) {
    cat("First chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in first chunk predictions at indices: ", which(!first_comparison[!first_na]), "\n")
  }
  
  # Last chunk comparison
  last_full_activity <- full_activity[(nrow(dat) - chunk_size + 1):nrow(dat)]
  last_comparison <- last_chunk_activity == last_full_activity
  last_na <- is.na(last_chunk_activity) | is.na(last_full_activity)
  
  cat("Last chunk comparison:\n")
  if (all(last_comparison[!last_na])) {
    cat("Last chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in last chunk predictions at indices: ", which(!last_comparison[!last_na]), "\n")
  }
  
  return(list(
    full_activity = full_activity,
    middle_chunk_activity = middle_chunk_activity,
    first_chunk_activity = first_chunk_activity,
    last_chunk_activity = last_chunk_activity,
    middle_comparison = middle_comparison,
    first_comparison = first_comparison,
    last_comparison = last_comparison
  ))
}

test_results <- test_predict_kohonen_behavior(dat, MSOM, chunk_size = 1000)

@wcornwell
Copy link
Collaborator Author

wcornwell commented Sep 5, 2024

not sure what to do about that behavior. @jack-bilby @dfalster ?

https://en.wikipedia.org/wiki/Self-organizing_map

@wcornwell
Copy link
Collaborator Author

Screenshot 2024-09-05 at 1 03 39 PM

@dfalster
Copy link
Member

dfalster commented Sep 5, 2024

Darn. Thanks for investigating. In that case, I can see these options

  1. Get rid of the parallelisation, just run on a machine with more memory
  2. Create chunks with overlaps to reduce edge effects, use the middle section when reconstructing (risky, untested)
  3. Rewrite the whole predictive step in C++ (time consuming).

I reckon #1 is the way forward.

@wcornwell
Copy link
Collaborator Author

Yup.

Aside from the computational annoyance, @jack-bilby I think we can write the methods in a more informed way now.

@dfalster
Copy link
Member

dfalster commented Sep 5, 2024

BTW - nice work implementing the parallelisation, and then testing for consistency. That as wise to check. shame it wasn't so early parallelisable.

@wcornwell
Copy link
Collaborator Author

i think it might be by design this "neighborhood" prediction thing.

surprisingly, it's not like random forest at all, more like CNN.

@wcornwell
Copy link
Collaborator Author

interesting for multiple projects from chatgpt:

Row-wise Prediction (Independent instances)

These methods predict based on individual rows, treating each instance as a separate feature vector without explicit context from neighboring rows:

  1. Support Vector Machines (SVM): Classifies each row independently based on the feature space.
  2. Random Forest (RF): Each row is classified independently using decision trees.
  3. Gradient Boosting Machines (XGBoost, LightGBM): Like Random Forests, they classify each row independently after learning based on features.
  4. K-Nearest Neighbors (KNN) (unless combined with DTW, see below): Each row is treated independently, but the prediction is based on the closest rows (neighbors) in the feature space, so it’s somewhat neighborhood-dependent but not sequential.

Neighborhood Prediction (Uses neighboring or temporal context)

These methods take the surrounding or neighboring data points into account when making predictions, making them better suited for sequential and time series data:

  1. Recurrent Neural Networks (RNNs) (including LSTMs and GRUs): Predict by maintaining hidden states that store information from previous time steps, taking neighboring data points into account.

  2. Convolutional Neural Networks (CNNs): Can predict using a "receptive field" that captures information from a window or neighborhood of time steps, making it neighborhood-aware.

  3. Hidden Markov Models (HMMs): Each prediction depends on both the current observation and the hidden state, which is influenced by previous states, effectively using neighboring information.

  4. Dynamic Time Warping (DTW) + KNN: Measures similarity between entire time series or subsequences (neighborhood) instead of individual rows. DTW accounts for time shifts between sequences, and KNN uses the most similar neighbors.

  5. Autoencoders (especially temporal variants): Although often used for feature extraction, temporal autoencoders take neighboring time steps into account.

  6. Self-Organizing Maps (SOMs): Neurons in the map represent clusters of similar data points. The proximity between neurons reflects the similarity of the input data, so neighboring neurons influence predictions.

Summary

  • Neighborhood-based: RNNs, CNNs, HMMs, DTW + KNN, autoencoders, SOMs.
  • Row-wise: SVM, Random Forest, Gradient Boosting, standard KNN.

Neighborhood methods are generally better suited for time series because they naturally capture the temporal structure and dependencies in the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants