Replies: 2 comments 4 replies
-
Hello 👋
Don't apologize! I'm grateful you took the time to try out the library and provide feedback. It's very useful. Learn One / Predict One API
This can be the user's perception, but it's not always true. Some models, like naive Bayes, have this property. But others, such as linear regression, don't. We encourage
There shouldn't be any reason to mix both training paradigms. You should think of it a bit like scikit-learn has some support for mini-batching training, without it being the main paradigm. It's one or the other, but not both. I understand the ideal you're aiming for, but in practice a difference between pure streaming and mini-batching has to be made.
There is a
There's a lot to unpack here. In terms of validation, there is In terms of model selection, there is an To add on top of the previous paragraph, it has to be mentioned that online model selection is a tricky topic. Ideally, you want to monitor model performance online, and not just find good parameters offline. That's the purpose of the What would you say is missing in terms of tooling? What is giving you pain?
Feel welcome to contribute connectors :). I'm very aware of those libraries, but I've never used them yet.
I'm glad you bring this up. We have an ongoing project called light-river, which will implement the "best" models in Rust, with a focus on performance and portability. I hope this helps.
Cheers. |
Beta Was this translation helpful? Give feedback.
-
Hi! High Level AbstractionsAdmittedly, I was not aware of the For reference, I add what I was looking for in terms of tooling, in order of importance (tentative). Note that this list comes from my experience with sklearn rather than River. Hence, some points might not be relevant in the online learning case (apologies for this, but my feeling is that online ml requires a change of mindset that takes some time to settle, and I'm still trying to understand if it suits my use cases).
Learn One / Predict One vs Learn Many / Predict Many
Totally understand this. But, in practice, when making experiments, it's a bit uncomfortable having to differentiate between the two. What I mean is that I would like to have a general method, like ####################################################################
# Possible approach
for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
y = x.pop('target')
y_pred = model.predict_proba(x)
model.learn(x, y)
####################################################################
####################################################################
# If the model supports learn_many, the learn method above is equivalent to learn_many method
for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
y = x.pop('target')
y_pred = model.predict_proba(x) # Do something with the prediction
model.learn_many(x, y)
####################################################################
###########################################################
# If the model does not support learn_many, the learn method above is internally doing something like this
for batch in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
for row in batch.to_dict("records"): # This is just an example, not necessarily the best approach. But the idea is to split the batch in a list of dicts with a single sample
y = row.pop('target')
y_pred = model.predict_proba(row)
# Here we can append the prediction to a batch container?
model.learn_one(X, y)
#################################################################### The problem then is that the batch size used during training might not be the same as the batch size used in production. Which tipically equals one. Again, this is just an example, and not necessarily a good one. Another approach might be adding the possibility to define your own loop and, for instance, use it with the model selectors in the DataI'll try to come up with an example using Polars and Datasets! When ready, TorchData might also be an option. Rust
Again, thanks a lot for taking the time to read and reply (I'll try to make shorter posts in the future)! |
Beta Was this translation helpful? Give feedback.
-
Hi!
I apologize in advance for the lengthy post but, after playing a bit with it, I would like to share a few considerations about River.
Background
I mainly work with sensors in the manufactuing world, where streaming data processing is our everyday's bread. Currently we develop models mainly based on
sklearn
or custom versions of them (e.g. we follow thefit
,predict
API). Honestly, however, this choice has many conceptual (streaming vs batch) and practical (deployment) limitations. When I started searching for something different, I found River. I've been playing with it a little bit in my spare time. I really like it and I think that, in the future, I might convince myself to switch to it. Before doing that, however, I would like to share a few considerations.Learn One / Predict One API
If I understood well, River was first created for learning and predicting one data point at a time. Batched learning and batched predictions was added later, and only for some estimators. However, I don't fully understnad the need to differentiate the two APIs. From the user perspective, the
learn_one
,predict_one
API is just a special case oflearn_many
andpredict_many
. Of course, if you are using River it's very likely that in production you will receive one sample at a time. But when you are making experiments/training, you tipically have datasets and batches of data. Having to differentiate between the two cases by litterally calling different methods is a bit annoying in my opinion. The same applies tolearn_one_proba
andpredict_one_proba
.Having said that, I am not an online learning expert. So, there might also be a theoretical reason for making this distinction (e.g., maybe online learning simply is something different from classical machine learning, not a generalization of it. I just don't know enough). Moreover, In order to omogenize the two APIs, I guess that there are probably a lot of subtleties that one must take into account. Starting from the input type of the data (dicts vs dataframes/arrays).
To summarize, I would say that my dream would be to have a framework that seamlessly works with both batch and online learning. But, of course, that is really hard to do.
Custom Models / Preprocessing / Metrics
It would be nice to have a way for the user to implement custom models and preprocessors. I know that implementing online models is far from trivial, but sometimes the user might simply want to customize an existing one or add a simple step to a pipeline. This is really easy to do in
sklearn
since you (almost) only need to implementfit
,predict
andtransform
, and this is something we do very often. This also applies to preprocessing and metrics (and maybe other classes?). Actually, I don't know if this is already possible, but I haven't found anything on the documentation.High Level Abstractions
Scikit-learn provides many tools related to handling data (e.g.
train_test_split
, cross validation iterators,...), validating models, selecting models,.... However, I've always felt like we could standardize or better organize common tasks. The pytorch ecosystem is much more mature in this sense. That's one of the reasons why Pytorch Lightning exists. In particular I've always liked theLightningDatamodule
and theTrainer
abstractions, as they speed up the development process without hiding lower level concepts when someone needs them. In a remote future, it would be nice to have something similar (much simpler) for the River ecosystem.Data
Here I would just like to mention two of the libraries I like the most for handling data and that have good support for streaming data: Huggingface Datasets and Polars. With some tweaks, I think they already are compatible with River and it would be nice to have some examples in the documentation.
Rust / Performance / Deployment
Another big advantage of deep learning frameworks is that you can easily deploy custom models to ONNX. This is possible with default scikit learn models as well, but if you have a custom model, you have to write your own ONNX converter. This is a big pain point in our field, where we are often asked to deploy on very small machines and where performance is critical.
Having parts of River's backend written in Rust is very interesting, as more and more packages are following that direction. However, I am wondering whether that could also help with the final goal of deploying models to environments without a python interpreter (like onnx runtime) or not. That would be super useful and a big advantage over scikit learn!
This is essentially a list of my desires, rather than a real contribution. So I apologize again. I am not expecting nothing in particular, I just wanted to share the view of an external user.
In any case, I want to thank you for the cool framework and for your work!
Beta Was this translation helpful? Give feedback.
All reactions