predict.konenen is failing if the matrix is too big #13

wcornwell · 2024-09-04T04:24:07Z

12: aperm.default(X, c(s.call, s.ans))
11: aperm(X, c(s.call, s.ans))
10: apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction)
9: which(apply(x, 1, function(y) (sum(is.na(y))/length(y)) > maxNA.fraction))
8: FUN(X[[i]], ...)
7: lapply(data, function(x) which(apply(x, 1, function(y) (sum(is.na(y))/length(y)) >
maxNA.fraction)))
6: check.data.na(newdata, maxNA.fraction = maxNA.fraction)
5: map.kohonen(object, newdata = newdata, whatmap = whatmap.new,
...)
4: map(object, newdata = newdata, whatmap = whatmap.new, ...)
3: predict.kohonen(MSOM, newdata = dd, whatmap = 1)
2: predict(MSOM, newdata = dd, whatmap = 1) at #10
1: classify_and_plot(1/60)

wcornwell · 2024-09-04T04:24:43Z

possible solution from @dfalster :

data |>

mutate(

chunk = seq(1, n(), by = 10)

) |>

split(~chunk) |>

purrr::map(~predict(MSOM, newdata = .x, whatmap = 1)) |>
purrr::list_rbind()

wcornwell · 2024-09-04T07:44:57Z

possible solution here: 0ca8092

needs testing...

wcornwell · 2024-09-04T22:57:19Z

worked for me, @jack-bilby . Took about a little over an hour for the file you sent me

dfalster · 2024-09-04T23:23:22Z

Just noting the small MSOM object Will has used here won't give very meaningful results, and will likely be faster than using the full object.

If it's that slow, this could be an argument to parallelise this step?

jack-bilby · 2024-09-05T00:01:53Z

Will, is that output with that solution you suggested? Or with the original code?

I think it would be a good idea to at least have an option to chunk/parallelise the processes, especially for larger datasets.

wcornwell · 2024-09-05T00:09:17Z

Working on it

wcornwell · 2024-09-05T02:54:43Z

Looking like self organizing map predictions are not row-by-row. they take some kind complex window type thing. So when I chunk the imput file, I get edge effects on the chunks.

# Function to compare full data prediction vs chunk predictions
test_predict_kohonen_behavior <- function(dat, MSOM, chunk_size = 100) {
  # Step 1: Full dataset prediction
  full_data_matrix <- as.matrix(dat[, -1])
  full_prediction <- kohonen:::predict.kohonen(MSOM, newdata = full_data_matrix, whatmap = 1)
  full_activity <- full_prediction$predictions$activity
  
  # Step 2: Large chunk from the middle of the dataset
  middle_start <- nrow(dat) %/% 2 - chunk_size %/% 2
  middle_chunk <- dat[seq(from = middle_start, length.out = chunk_size), -1]
  middle_chunk_matrix <- as.matrix(middle_chunk)
  middle_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = middle_chunk_matrix, whatmap = 1)
  middle_chunk_activity <- middle_chunk_prediction$predictions$activity
  
  # Step 3: Edge case - Small chunk from the beginning of the dataset
  first_chunk <- dat[1:chunk_size, -1]
  first_chunk_matrix <- as.matrix(first_chunk)
  first_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = first_chunk_matrix, whatmap = 1)
  first_chunk_activity <- first_chunk_prediction$predictions$activity
  
  # Step 4: Edge case - Small chunk from the end of the dataset
  last_chunk <- dat[(nrow(dat) - chunk_size + 1):nrow(dat), -1]
  last_chunk_matrix <- as.matrix(last_chunk)
  last_chunk_prediction <- kohonen:::predict.kohonen(MSOM, newdata = last_chunk_matrix, whatmap = 1)
  last_chunk_activity <- last_chunk_prediction$predictions$activity
  
  # Step 5: Comparison of full dataset predictions with chunk predictions
  
  # Middle chunk comparison
  middle_full_activity <- full_activity[middle_start:(middle_start + chunk_size - 1)]
  middle_comparison <- middle_chunk_activity == middle_full_activity
  middle_na <- is.na(middle_chunk_activity) | is.na(middle_full_activity)
  
  cat("Middle chunk comparison:\n")
  if (all(middle_comparison[!middle_na])) {
    cat("Middle chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in middle chunk predictions at indices: ", which(!middle_comparison[!middle_na]), "\n")
  }
  
  # First chunk comparison
  first_full_activity <- full_activity[1:chunk_size]
  first_comparison <- first_chunk_activity == first_full_activity
  first_na <- is.na(first_chunk_activity) | is.na(first_full_activity)
  
  cat("First chunk comparison:\n")
  if (all(first_comparison[!first_na])) {
    cat("First chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in first chunk predictions at indices: ", which(!first_comparison[!first_na]), "\n")
  }
  
  # Last chunk comparison
  last_full_activity <- full_activity[(nrow(dat) - chunk_size + 1):nrow(dat)]
  last_comparison <- last_chunk_activity == last_full_activity
  last_na <- is.na(last_chunk_activity) | is.na(last_full_activity)
  
  cat("Last chunk comparison:\n")
  if (all(last_comparison[!last_na])) {
    cat("Last chunk predictions match full data.\n")
  } else {
    cat("Discrepancies found in last chunk predictions at indices: ", which(!last_comparison[!last_na]), "\n")
  }
  
  return(list(
    full_activity = full_activity,
    middle_chunk_activity = middle_chunk_activity,
    first_chunk_activity = first_chunk_activity,
    last_chunk_activity = last_chunk_activity,
    middle_comparison = middle_comparison,
    first_comparison = first_comparison,
    last_comparison = last_comparison
  ))
}

test_results <- test_predict_kohonen_behavior(dat, MSOM, chunk_size = 1000)

wcornwell · 2024-09-05T02:55:22Z

not sure what to do about that behavior. @jack-bilby @dfalster ?

https://en.wikipedia.org/wiki/Self-organizing_map

wcornwell · 2024-09-05T03:03:55Z

dfalster · 2024-09-05T03:12:20Z

Darn. Thanks for investigating. In that case, I can see these options

Get rid of the parallelisation, just run on a machine with more memory
Create chunks with overlaps to reduce edge effects, use the middle section when reconstructing (risky, untested)
Rewrite the whole predictive step in C++ (time consuming).

I reckon #1 is the way forward.

wcornwell · 2024-09-05T04:44:17Z

Yup.

Aside from the computational annoyance, @jack-bilby I think we can write the methods in a more informed way now.

dfalster · 2024-09-05T04:46:40Z

BTW - nice work implementing the parallelisation, and then testing for consistency. That as wise to check. shame it wasn't so early parallelisable.

wcornwell · 2024-09-05T05:02:55Z

i think it might be by design this "neighborhood" prediction thing.

surprisingly, it's not like random forest at all, more like CNN.

wcornwell · 2024-09-05T05:47:54Z

interesting for multiple projects from chatgpt:

Row-wise Prediction (Independent instances)

These methods predict based on individual rows, treating each instance as a separate feature vector without explicit context from neighboring rows:

Support Vector Machines (SVM): Classifies each row independently based on the feature space.
Random Forest (RF): Each row is classified independently using decision trees.
Gradient Boosting Machines (XGBoost, LightGBM): Like Random Forests, they classify each row independently after learning based on features.
K-Nearest Neighbors (KNN) (unless combined with DTW, see below): Each row is treated independently, but the prediction is based on the closest rows (neighbors) in the feature space, so it’s somewhat neighborhood-dependent but not sequential.

Neighborhood Prediction (Uses neighboring or temporal context)

These methods take the surrounding or neighboring data points into account when making predictions, making them better suited for sequential and time series data:

Recurrent Neural Networks (RNNs) (including LSTMs and GRUs): Predict by maintaining hidden states that store information from previous time steps, taking neighboring data points into account.
Convolutional Neural Networks (CNNs): Can predict using a "receptive field" that captures information from a window or neighborhood of time steps, making it neighborhood-aware.
Hidden Markov Models (HMMs): Each prediction depends on both the current observation and the hidden state, which is influenced by previous states, effectively using neighboring information.
Dynamic Time Warping (DTW) + KNN: Measures similarity between entire time series or subsequences (neighborhood) instead of individual rows. DTW accounts for time shifts between sequences, and KNN uses the most similar neighbors.
Autoencoders (especially temporal variants): Although often used for feature extraction, temporal autoencoders take neighboring time steps into account.
Self-Organizing Maps (SOMs): Neurons in the map represent clusters of similar data points. The proximity between neurons reflects the similarity of the input data, so neighboring neurons influence predictions.

Summary

Neighborhood-based: RNNs, CNNs, HMMs, DTW + KNN, autoencoders, SOMs.
Row-wise: SVM, Random Forest, Gradient Boosting, standard KNN.

Neighborhood methods are generally better suited for time series because they naturally capture the temporal structure and dependencies in the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict.konenen is failing if the matrix is too big #13

predict.konenen is failing if the matrix is too big #13

wcornwell commented Sep 4, 2024

wcornwell commented Sep 4, 2024

wcornwell commented Sep 4, 2024

wcornwell commented Sep 4, 2024 •

edited

Loading

dfalster commented Sep 4, 2024

jack-bilby commented Sep 5, 2024

wcornwell commented Sep 5, 2024

wcornwell commented Sep 5, 2024 •

edited

Loading

wcornwell commented Sep 5, 2024 •

edited

Loading

wcornwell commented Sep 5, 2024

dfalster commented Sep 5, 2024

wcornwell commented Sep 5, 2024

dfalster commented Sep 5, 2024

wcornwell commented Sep 5, 2024

wcornwell commented Sep 5, 2024

predict.konenen is failing if the matrix is too big #13

predict.konenen is failing if the matrix is too big #13

Comments

wcornwell commented Sep 4, 2024

wcornwell commented Sep 4, 2024

wcornwell commented Sep 4, 2024

wcornwell commented Sep 4, 2024 • edited Loading

dfalster commented Sep 4, 2024

jack-bilby commented Sep 5, 2024

wcornwell commented Sep 5, 2024

wcornwell commented Sep 5, 2024 • edited Loading

wcornwell commented Sep 5, 2024 • edited Loading

wcornwell commented Sep 5, 2024

dfalster commented Sep 5, 2024

wcornwell commented Sep 5, 2024

dfalster commented Sep 5, 2024

wcornwell commented Sep 5, 2024

wcornwell commented Sep 5, 2024

Row-wise Prediction (Independent instances)

Neighborhood Prediction (Uses neighboring or temporal context)

Summary

wcornwell commented Sep 4, 2024 •

edited

Loading

wcornwell commented Sep 5, 2024 •

edited

Loading

wcornwell commented Sep 5, 2024 •

edited

Loading