Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parallelism to train.jl and improved efficiency of types.jl #8

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Skylion007
Copy link

I added parrellism so that you can train it with several processes. I felt conflicted about whether or not I should force types.jl to be included using the "@Everywhere" notation, but I left that up to you. I also utilized shared arrays to speed up execution and finally I had to make some types immutable so that they are bittypes. This allows them to be used in SharedArrays and should overall improve the speed of execution since it's more efficient.

@aaw
Copy link
Owner

aaw commented May 10, 2016

Hi @Skylion007, this is neat, thanks for looking into parallelizing the training. I pulled your patch down and ran a sample training loop with it, something like:

julia> import IncrementalSVD
julia> rating_set = IncrementalSVD.load_small_movielens_dataset();
...
julia> model = IncrementalSVD.train(rating_set, 25);
...
julia> IncrementalSVD.rmse(rating_set, model)
3.5707712042854016

That RMSE is pretty far from the ~0.92 that the existing code achieves with the same sequence of commands. Accordingly, if you run more of the sample commands in the README (like IncrementalSVD.similar_items(model, "Friday the 13th (1980)")), you'll get qualitatively bad results with this patch just like the high RMSE predicts.

This may be because there's an error in your patch? I noticed that your loop uses residual.curr_error in a few places before that current error has been updated for the current iteration, which may be part of it (for example, error_diff = residual.prev_error - residual.curr_error, but residual.curr_error is not the current error for that iteration because it hasn't yet been set to residual.value - user_feature * item_feature).

It may also be that this loop just isn't amenable to a naive parallel execution: each iteration uses and updates values of item_features and user_features that depend on the current rating and it's pretty easy to imagine some amount of conflict between two ratings attempting to update both at the same time causing problems.

In either event, if you can update the patch to get the RMSE down to something reasonable, I'd be happy to take another look.

@Skylion007
Copy link
Author

Ah, the way I parallelized should have accounted for this and made sure that two ratings weren't being updated at the same time. The error probably has to do with the residual curr_error. That's interesting though because I tested it on a really massive dataset and got a pretty consistent RMSE, all be it that was over 200 million ratings so perhaps that is why the residual error was minuscule. I'll be glad to try to fix it an test it out when I have the chance.

@Skylion007
Copy link
Author

On second thought, you may be right about the parrellelism issue. Oddly enough, it worked on my dataset because it was so massive. With approximately 200,000 ratings, the odds of two items overwriting the same value are relatively small, but on these smaller datasets, the error is much more pronounced. I may have to figure out how to rewrite this if I want to add parralelism. I found an implementation that works by relying on a map reduce style framework, but Julia's code just doesn't seem amicable to that. Perhaps if Julia adds read/writes lock or another method of safely makes these values concurrent.

Actually, one way that might work is having each process calculate it's own portion of the matrix and somehow combine them using SIMD operations. Alternatively, groupby algorithms would make it relatively easy to divide the work. I'll look into this issue. I'll also leave this code since it works well on large datasets with lots of values since the error is relatively small.

@Skylion007
Copy link
Author

So apparently, Julia does have support for locking via mutexes. Since the model will be relatively small compared to the number of ratings, it might be possible to have a parallel array of mutexes to prevent collision. The overhead should be relatively minor even on large datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants