Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ilof pietro #1317

Merged
merged 43 commits into from
Sep 11, 2023
Merged

ilof pietro #1317

merged 43 commits into from
Sep 11, 2023

Conversation

pietro-tanure
Copy link
Contributor

…into ilof-pietro

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@pietro-tanure
Copy link
Contributor Author

pietro-tanure commented Aug 24, 2023

@hoanganhngo610 This is the same code of the previous pull request, that I accidentally closed (#1232 (comment)) when trying to sync my branch with the current state of river. This branch is now up to date and compatible with the latest river commit

@hoanganhngo610
Copy link
Contributor

Thank you very much @pietro-tanure. I can see this would be an editable PR, so if possible and wherever necessary I would make the necessary changes to the code, or leave comments for further discussion.
I would be able to dig deeper into this from next week.

@hoanganhngo610 hoanganhngo610 self-requested a review August 25, 2023 04:44
@MaxHalford
Copy link
Member

@hoanganhngo610 thanks for taking care of this. Is it going well?

I don't think we want to merge the notebook. What would be cool is to turn the notebook into a unit test, to show that the results are consistent with scikit-learn.

Regarding the name, I would call it LocalOutlierFactor, to be coherent with scikit-learn. Indeed, both implementations provide the same results. Also, I don't think many people can be expected to know what the acronym LOF corresponds to.

@hoanganhngo610
Copy link
Contributor

hoanganhngo610 commented Sep 2, 2023

@MaxHalford Thank you so much for the comments!
First of all, it's going quite well IMO. What I'm actually doing is refactoring code and modify certain points in the algorithm to make sure that the output it generates goes along with our standards and what we have been doing with previous algorithms. Apart from that, this is a really decent implementation of the online LOF.

Moreover, regarding the name, if you prefer to go with the full name of the LOF, i.e LocalOutlierFactor, I would suggest going with IncrementalLocalOutlierFactor, to emphasize the fact that the algorithm implemented within River is actually the incremental version of the original LOF version!

@MaxHalford
Copy link
Member

Every algorithm in River is incremental, so I don't think there's a need to prepend the name with Incremental :)

@hoanganhngo610
Copy link
Contributor

hoanganhngo610 commented Sep 3, 2023

Yep got it! In this case I will change the name to LocalOutlierFactor in future commits.

@hoanganhngo610
Copy link
Contributor

@pietro-tanure @MaxHalford After going through the code, I am suggesting that we can have two functions, one for one instance and one for multiple instances, for the learning phase (learn_one and learn_many) and for the scoring phase (score_one and score_many). Do you think that is a good option?

@pietro-tanure
Copy link
Contributor Author

@pietro-tanure @MaxHalford After going through the code, I am suggesting that we can have two functions, one for one instance and one for multiple instances, for the learning phase (learn_one and learn_many) and for the scoring phase (score_one and score_many). Do you think that is a good option?

@hoanganhngo610 Thank you a lot for what you have been doing. I think that's a good idea, the original code had learn_one and learn_many functions. It had as well a score_one function with an argument window_score that represented the size of the batch to score, so it could score many points at once as well, but a score_many function might me better to make it clearer.

@MaxHalford
Copy link
Member

Hey there! Great work to both of you.

I'm aligned with learn_many. However, I would only add score_many if it's not a for loop over score_one.

@hoanganhngo610
Copy link
Contributor

Thank you so much @pietro-tanure @MaxHalford for the comments! In that case, I would bring back learn_many to the table for it to work with inputs as data frames, while doing a if/else within score_one to make it adapt to the input that is given (dict as score_one and pandas dataframe for score_many). Does that sound like a good plan?

@hoanganhngo610
Copy link
Contributor

@pietro-tanure @MaxHalford I think I have gone extensively through this PR. Taking into account the suggestion of Max, I have removed the notebook from the PR and replaced it with a test file. However, @pietro-tanure, please rest assured that the notebook is fascinating, and I am quite sure that I will be able to demonstrate it elsewhere (will update you on that later).
In the meantime, I think this PR is ready to be merged. The only problem remains would be the tests that keep failing, especially ubuntu river build and code quality. @MaxHalford from my side, the code quality tests all passed locally, but somehow one error remains with the tests running on GitHub, at test_glm file.

@MaxHalford
Copy link
Member

Thanks @hoanganhngo610! I can take over from here and take care of the tests. Congrats to both of you on the good work :)

@MaxHalford MaxHalford merged commit 11f6cf9 into online-ml:main Sep 11, 2023
6 of 8 checks passed
@pietro-tanure
Copy link
Contributor Author

Thank you very much @hoanganhngo610 for all the work, I'm very happy with the result of the code. Thank you @MaxHalford as well for your readiness to help.

@MaxHalford
Copy link
Member

@hoanganhngo610 @pietro-tanure FYI the tests were failing because LOF needs to be warm-started to function correctly. Indeed if you run it with a simple progressive validation for loop it breaks.

from river import anomaly, utils, datasets

model = anomaly.LocalOutlierFactor()
dataset = datasets.Phishing()

for x, y in dataset:
    model.learn_one(x)
    model.score_one(x)

This isn't ideal, so for now I've disabled the tests. I'd love to see this fixed at some point.

@hoanganhngo610
Copy link
Contributor

@MaxHalford I also noticed this when testing the algorithm locally, and that's also why I designed the test file so that it learns a certain amount of data points before actually doing the scoring. I know that this would not be ideal, but given the nature of the algorithm, I believe it is somewhat acceptable. However, we can try to fix this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants