A few considerations #1191

mbignotti · 2023-02-23T17:44:23Z

mbignotti
Feb 23, 2023

Hi!
I apologize in advance for the lengthy post but, after playing a bit with it, I would like to share a few considerations about River.

Background

I mainly work with sensors in the manufactuing world, where streaming data processing is our everyday's bread. Currently we develop models mainly based on sklearn or custom versions of them (e.g. we follow the fit, predict API). Honestly, however, this choice has many conceptual (streaming vs batch) and practical (deployment) limitations. When I started searching for something different, I found River. I've been playing with it a little bit in my spare time. I really like it and I think that, in the future, I might convince myself to switch to it. Before doing that, however, I would like to share a few considerations.

Learn One / Predict One API

If I understood well, River was first created for learning and predicting one data point at a time. Batched learning and batched predictions was added later, and only for some estimators. However, I don't fully understnad the need to differentiate the two APIs. From the user perspective, the learn_one, predict_one API is just a special case of learn_many and predict_many. Of course, if you are using River it's very likely that in production you will receive one sample at a time. But when you are making experiments/training, you tipically have datasets and batches of data. Having to differentiate between the two cases by litterally calling different methods is a bit annoying in my opinion. The same applies to learn_one_proba and predict_one_proba.
Having said that, I am not an online learning expert. So, there might also be a theoretical reason for making this distinction (e.g., maybe online learning simply is something different from classical machine learning, not a generalization of it. I just don't know enough). Moreover, In order to omogenize the two APIs, I guess that there are probably a lot of subtleties that one must take into account. Starting from the input type of the data (dicts vs dataframes/arrays).
To summarize, I would say that my dream would be to have a framework that seamlessly works with both batch and online learning. But, of course, that is really hard to do.

Custom Models / Preprocessing / Metrics

It would be nice to have a way for the user to implement custom models and preprocessors. I know that implementing online models is far from trivial, but sometimes the user might simply want to customize an existing one or add a simple step to a pipeline. This is really easy to do in sklearn since you (almost) only need to implement fit, predict and transform, and this is something we do very often. This also applies to preprocessing and metrics (and maybe other classes?). Actually, I don't know if this is already possible, but I haven't found anything on the documentation.

High Level Abstractions

Scikit-learn provides many tools related to handling data (e.g. train_test_split, cross validation iterators,...), validating models, selecting models,.... However, I've always felt like we could standardize or better organize common tasks. The pytorch ecosystem is much more mature in this sense. That's one of the reasons why Pytorch Lightning exists. In particular I've always liked the LightningDatamodule and the Trainer abstractions, as they speed up the development process without hiding lower level concepts when someone needs them. In a remote future, it would be nice to have something similar (much simpler) for the River ecosystem.

Data

Here I would just like to mention two of the libraries I like the most for handling data and that have good support for streaming data: Huggingface Datasets and Polars. With some tweaks, I think they already are compatible with River and it would be nice to have some examples in the documentation.

Rust / Performance / Deployment

Another big advantage of deep learning frameworks is that you can easily deploy custom models to ONNX. This is possible with default scikit learn models as well, but if you have a custom model, you have to write your own ONNX converter. This is a big pain point in our field, where we are often asked to deploy on very small machines and where performance is critical.
Having parts of River's backend written in Rust is very interesting, as more and more packages are following that direction. However, I am wondering whether that could also help with the final goal of deploying models to environments without a python interpreter (like onnx runtime) or not. That would be super useful and a big advantage over scikit learn!

This is essentially a list of my desires, rather than a real contribution. So I apologize again. I am not expecting nothing in particular, I just wanted to share the view of an external user.

In any case, I want to thank you for the cool framework and for your work!

MaxHalford · 2023-02-24T21:13:46Z

MaxHalford
Feb 24, 2023
Maintainer

Hello 👋

I apologize in advance for the lengthy post but, after playing a bit with it, I would like to share a few considerations about River.

Don't apologize! I'm grateful you took the time to try out the library and provide feedback. It's very useful.

Learn One / Predict One API

From the user perspective, the learn_one, predict_one API is just a special case of learn_many and predict_many

This can be the user's perception, but it's not always true. Some models, like naive Bayes, have this property. But others, such as linear regression, don't.

We encourage *_one methods everywhere. We added the *_many because there was a demand for it. For instance, it allows Vaex to use River for training on billion row datasets.

Having to differentiate between the two cases by litterally calling different methods is a bit annoying in my opinion.

There shouldn't be any reason to mix both training paradigms. You should think of it a bit like scikit-learn has some support for mini-batching training, without it being the main paradigm. It's one or the other, but not both. I understand the ideal you're aiming for, but in practice a difference between pure streaming and mini-batching has to be made.

It would be nice to have a way for the user to implement custom models and preprocessors. I know that implementing online models is far from trivial, but sometimes the user might simply want to customize an existing one or add a simple step to a pipeline. This is really easy to do in sklearn since you (almost) only need to implement fit, predict and transform, and this is something we do very often. This also applies to preprocessing and metrics (and maybe other classes?). Actually, I don't know if this is already possible, but I haven't found anything on the documentation.

There is a base module which enables precisely this. It's very much like what scikit-learn does. We haven't documented it, but you just have to look at any model source code to see how the base module is imported and used.

Scikit-learn provides many tools related to handling data (e.g. train_test_split, cross validation iterators,...), validating models, selecting models,.... However, I've always felt like we could standardize or better organize common tasks. The pytorch ecosystem is much more mature in this sense. That's one of the reasons why Pytorch Lightning exists. In particular I've always liked the LightningDatamodule and the Trainer abstractions, as they speed up the development process without hiding lower level concepts when someone needs them. In a remote future, it would be nice to have something similar (much simpler) for the River ecosystem.

There's a lot to unpack here. In terms of validation, there is evaluate.progressive_val_score. It does progressive validation, which is the de facto method for evaluating online models.

In terms of model selection, there is an utils.expand_param_grid to expand a grid of hyperparameters into multiple models. Then you can loop over the models and call evaluate.progressive_val_score. I think this is flexible and not too verbose, so I don't see a need to introduce another abstraction. I could be wrong though, and I'm open to suggestions.

To add on top of the previous paragraph, it has to be mentioned that online model selection is a tricky topic. Ideally, you want to monitor model performance online, and not just find good parameters offline. That's the purpose of the model_selection, which contains logic to run many models in parallel and pick the best one in real-time.

What would you say is missing in terms of tooling? What is giving you pain?

Here I would just like to mention two of the libraries I like the most for handling data and that have good support for streaming data: Huggingface Datasets and Polars. With some tweaks, I think they already are compatible with River and it would be nice to have some examples in the documentation.

Feel welcome to contribute connectors :). I'm very aware of those libraries, but I've never used them yet.

Having parts of River's backend written in Rust is very interesting, as more and more packages are following that direction. However, I am wondering whether that could also help with the final goal of deploying models to environments without a python interpreter (like onnx runtime) or not. That would be super useful and a big advantage over scikit learn

I'm glad you bring this up. We have an ongoing project called light-river, which will implement the "best" models in Rust, with a focus on performance and portability.

I hope this helps.

In any case, I want to thank you for the cool framework and for your work!

Cheers.

0 replies

mbignotti · 2023-02-27T15:08:26Z

mbignotti
Feb 27, 2023
Author

Hi!
Thanks a lot for the nice and comprehensive reply!
Since I touched many topics, I guess I should have created different posts rather a single one :).
However, I'll try to reply one by one.

High Level Abstractions

Admittedly, I was not aware of the evaluate and model_selection modules. These modules already go in the direction I had in mind. Things that might be nice to add (but they are not really high priority) are callbacks and loggers. Examples in lightning:

For reference, I add what I was looking for in terms of tooling, in order of importance (tentative). Note that this list comes from my experience with sklearn rather than River. Hence, some points might not be relevant in the online learning case (apologies for this, but my feeling is that online ml requires a change of mindset that takes some time to settle, and I'm still trying to understand if it suits my use cases).

A general Trainer class that, for example, takes data, model(s) and some training configurations (e.g. number of epochs, hardware resources,...) and fits and validates the model by computing some metrics. Examples are the pytorch lightning trainer and the Huggingface-transformers trainer. In general, I've always liked the idea of decoupling the model definition from the model training.
A way of organizing data extraction, loading and transformation that could also be reusable across projects. Examples are pytorch datasets (or huggingface datasets mentioned above) and pytorch lightning datamodules.
A way of defining pipelines with a task-specific structure. By task here I mean regression, classification or anomaly detection. However, in some cases, one might want to customize the output of the task. For example, in anomaly detection, we tipically return both the anomaly score and the label in a single call (we created a custom estimator to be used after the anomaly detector, a thresholder, that takes the anomaly scores produced by the model, applies some threshold and returns both the anomaly scores and the label in a dictionary), while sklearn's outlier detectors only return the label when calling predict. Honestly, sklearn pipelines have always been a bit rigid in this regard (I admit this is a bit out of topic).
A way of saving model "checkpoints" at specified instants in time.
Nice to have: callbacks and loggers. In particular, it would be nice to be able to visualize the results in real time and to send metrics to, say, MLflow or similar.

Learn One / Predict One vs Learn Many / Predict Many

There shouldn't be any reason to mix both training paradigms. You should think of it a bit like scikit-learn has some support for mini-batching training, without it being the main paradigm. It's one or the other, but not both. I understand the ideal you're aiming for, but in practice a difference between pure streaming and mini-batching has to be made.

Totally understand this. But, in practice, when making experiments, it's a bit uncomfortable having to differentiate between the two. What I mean is that I would like to have a general method, like learn or update (or maybe the good old fit?) that internally differentiates between the two cases. Something like this (building on the mini-batching example from the docs):

####################################################################
# Possible approach
for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
    y = x.pop('target')
    y_pred = model.predict_proba(x)
    model.learn(x, y)
####################################################################


####################################################################
# If the model supports learn_many, the learn method above is equivalent to learn_many method
for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
    y = x.pop('target')
    y_pred = model.predict_proba(x) # Do something with the prediction
    model.learn_many(x, y)
####################################################################


###########################################################
# If the model does not support learn_many, the learn method above is internally doing something like this
for batch in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
    for row in batch.to_dict("records"): # This is just an example, not necessarily the best approach. But the idea is to split the batch in a list of dicts with a single sample
        y = row.pop('target')
        y_pred = model.predict_proba(row) 
        # Here we can append the prediction to a batch container?
        model.learn_one(X, y)
####################################################################

The problem then is that the batch size used during training might not be the same as the batch size used in production. Which tipically equals one. Again, this is just an example, and not necessarily a good one.
However, I understand that this might not really be the direction you want to give to River. And I can always write my own wrapper for it.

Another approach might be adding the possibility to define your own loop and, for instance, use it with the model selectors in the model_selection module. Then the user will be able to define reusable loops, to be used across different projects.

Data

I'll try to come up with an example using Polars and Datasets! When ready, TorchData might also be an option.

Rust

I'm glad you bring this up. We have an ongoing project called light-river, which will implement the "best" models in Rust, with a focus on performance and portability.
Nice!

Again, thanks a lot for taking the time to read and reply (I'll try to make shorter posts in the future)!

4 replies

MaxHalford Feb 27, 2023
Maintainer

A general Trainer class that, for example, takes data, model(s) and some training configurations (e.g. number of epochs, hardware resources,...) and fits and validates the model by computing some metrics. Examples are the pytorch lightning trainer and the Huggingface-transformers trainer. In general, I've always liked the idea of decoupling the model definition from the model training.

We have a notion of Track in the evaluate module. It's here to evaluate a model on a bunch of datasets and computes the relevant metrics. However, note that this is mostly for benchmarking purposes. The point of online learning is to learn, well, online. Usually that happens in a web application context, not in a notebook offline.

A way of organizing data extraction, loading and transformation that could also be reusable across projects. Examples are pytorch datasets (or huggingface datasets mentioned above) and pytorch lightning datamodules.

I'm not exactly sure what pain point you're looking to solve. With River, you're mostly working with Python dictionaries. Therefore, you have access to all the Python goodness. For instance, you could preprocess your data with toolz and/or glom.

A way of defining pipelines with a task-specific structure. By task here I mean regression, classification or anomaly detection. However, in some cases, one might want to customize the output of the task. For example, in anomaly detection, we tipically return both the anomaly score and the label in a single call (we created a custom estimator to be used after the anomaly detector, a thresholder, that takes the anomaly scores produced by the model, applies some threshold and returns both the anomaly scores and the label in a dictionary), while sklearn's outlier detectors only return the label when calling predict. Honestly, sklearn pipelines have always been a bit rigid in this regard (I admit this is a bit out of topic).

I'm not too sure what problems you are encountering. River has several kind of models (regression, classification, anomaly detection, etc.) and it also has pipelines. What can't you do with River in this regard?

A way of saving model "checkpoints" at specified instants in time.

Every object in River can be pickled. This is unit tested.

Nice to have: callbacks and loggers. In particular, it would be nice to be able to visualize the results in real time and to send metrics to, say, MLflow or similar.

I suppose that callbacks could be useful. We could add them to progressive_val_score. That way, you could for instance add a callback to send logs to MLFlow. Would that work for you?

Totally understand this. But, in practice, when making experiments, it's a bit uncomfortable having to differentiate between the two. What I mean is that I would like to have a general method, like learn or update (or maybe the good old fit?) that internally differentiates between the two cases. Something like this (building on the mini-batching example from the docs):

I understand what you mean, and we did think about that at some point. But it would mean adding another method to all the classes. We prefer to be explicit and provide a single way to do things.

mbignotti Feb 27, 2023
Author

We have a notion of Track in the evaluate module. It's here to evaluate a model on a bunch of datasets and computes the relevant metrics. However, note that this is mostly for benchmarking purposes. The point of online learning is to learn, well, online. Usually that happens in a web application context, not in a notebook offline.

I'm not exactly sure what pain point you're looking to solve. With River, you're mostly working with Python dictionaries. Therefore, you have access to all the Python goodness. For instance, you could preprocess your data with toolz and/or glom.

Hmmm... I did not explain those two ponts very well. Let me try to give some more context by taking pytorch lightning as an example.

The Trainer and the LightningDataModule that I mentioned are used together as a way of organizing the code in order to perform an end-to-end experiment (together with the LightningModule, but it's not relevant here).

The LightningDataModule handles anything related to data, while the Trainer fits, validates and tests a given model with the data passed by the provided LightningDataModule.

I quote from the lightning docs:

In normal PyTorch code, the data cleaning/preparation is usually scattered across many files. This makes sharing and reusing the exact splits and transforms across projects impossible.
Datamodules are for you if you ever asked the questions:

what splits did you use?
what transforms did you use?
what normalization did you use?
how did you prepare/tokenize the data?

Here is a code example:

import pytorch_lightning as pl
from torch.utils.data import random_split, DataLoader

# Note - you must have torchvision installed for this example
from torchvision.datasets import MNIST
from torchvision import transforms


class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str = "./"):
        super().__init__()
        self.data_dir = data_dir
        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

    def prepare_data(self):
        # download
        MNIST(self.data_dir, train=True, download=True)
        MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage: str):
        # Assign train/val datasets for use in dataloaders
        if stage == "fit":
            mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == "test":
            self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)

        if stage == "predict":
            self.mnist_predict = MNIST(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=32)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=32)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=32)

    def predict_dataloader(self):
        return DataLoader(self.mnist_predict, batch_size=32)

In essence, the datamodule simply provides an interface for defining all the data related tasks (mainly loading and transforming the data) and makes the data, transformed in the required shape, available to the rest of the experiment.

(The docs provide a much better explanation).

After writing the datamodule, we can combine it with the Trainer, which takes a model and the datamodule itself for fitting, validating, testing and making predictions. The Trainer exposes the fit, validate, test and predict API.

I quote again:

Under the hood, the Lightning Trainer handles the training loop details for you, some examples include:

Automatically enabling/disabling grads
Running the training, validation and test dataloaders
Calling the Callbacks at the appropriate times
Putting batches and computations on the correct devices

Again, the docs give a better explanation than mine.

Continuing from the example taken from the docs, we can use the trainer to train and validate a model:

mnist = MNISTDataModule(my_path) # Initialize the datamodule
model = SomeClassifier() # Initialize some model

trainer = Trainer() # Initialize the trainer

# Training
trainer.fit(model, mnist)
### Internally this is what happens (more or less) ###
train_dataloader = mnist.train_dataloader()
for batch_idx, batch in train_dataloader:
    model.training_step(batch, batch_idx)
########################################

# Validation
trainer.validate(mode, mnist)
### Internally this is what happens (more or less) ###
val_dataloader = mnist.val_dataloader()
for batch_idx, batch in val_dataloader:
    model.validation_step(batch, batch_idx)
########################################

# Testing
trainer.test(mode, mnist)
### Same logic as above ###

# Predicting
trainer.predict(mode, mnist)
### Same logic as above ###

Note that, If you need more control, you can write your own training, validation or prediction loop.

The Trainer has other methods/features as well, but this is its main logic.

I think this is a very nice way of doing batched learning. In particular, I want to emphasize the fact that the Trainer and the LightningDataModule are simply ways of organizing the code, they don't replace the modelling logic. For instance, toolz and/or glom could be used inside the setup method of the datamodule, but the datamodule itself does not replace toolz and/or glom. My thought was that it could be nice to have something similar for River, adapted for the online learning paradigm.

Personally, I like thinking in terms of experiments. I might be wrong, but even in the online learning case, I would never deploy a model without making some sort of "offline" experiment before. Even if that experiment is already simulating an online scenario.

Having said that, I think that the evaluate and the model_selection module you mentioned are already solving a big part of the issue and, as I've already said, they are going toward the direction I have in mind.

I'm not too sure what problems you are encountering. River has several kind of models (regression, classification, anomaly detection, etc.) and it also has pipelines. What can't you do with River in this regard?

Please ignore this. I haven't tried River's pipeline yet, so I am only talking about scikit-learn pipelines. The only issue here is that I find scikit-learn pipelines a bit rigid in some cases (for example, I cannot use estimators as intermediate steps, or it's hard to include postprocessing).

I suppose that callbacks could be useful. We could add them to progressive_val_score. That way, you could for instance add a callback to send logs to MLFlow. Would that work for you?

Yes, that is exactly what I had in mind. This might also open to door to, say, hyperparameter optimization with Optuna, for example.

I understand what you mean, and we did think about that at some point. But it would mean adding another method to all the classes. We prefer to be explicit and provide a single way to do things.

I understand. That's why It's probably better for me to write a simple wrapper, should I decide that it's something that I really need.

MaxHalford Feb 27, 2023
Maintainer

Having said that, I think that the evaluate and the model_selection module you mentioned are already solving a big part of the issue and, as I've already said, they are going toward the direction I have in mind.

That's my (biased) opinion too. I'm convince River needs more abstractions. I would be happy to be convinced otherwise. But for that to happen, I would need to see some code, some sample usage, and pros and cons explicitly listed.

Yes, that is exactly what I had in mind. This might also open to door to, say, hyperparameter optimization with Optuna, for example.

Great, I'll take of it and add it to my todo.

mbignotti Feb 27, 2023
Author

Agreed!
I just want to conclude by saying that, at least for the pytorch lightning case, the term "abstaction" is only part of the story. In fact, their entire mission is to organize pytorch code without hiding/preventing lower level access. That would be the goal here as well.

I think I need to familiarize a bit more with River and online learning in general. After doing that, in future, I might come with a few, more concrete, proposals.
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few considerations #1191

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

A few considerations #1191

mbignotti Feb 23, 2023

Background

Learn One / Predict One API

Custom Models / Preprocessing / Metrics

High Level Abstractions

Data

Rust / Performance / Deployment

Replies: 2 comments · 4 replies

MaxHalford Feb 24, 2023 Maintainer

Learn One / Predict One API

mbignotti Feb 27, 2023 Author

High Level Abstractions

Learn One / Predict One vs Learn Many / Predict Many

Data

Rust

MaxHalford Feb 27, 2023 Maintainer

mbignotti Feb 27, 2023 Author

MaxHalford Feb 27, 2023 Maintainer

mbignotti Feb 27, 2023 Author

mbignotti
Feb 23, 2023

Replies: 2 comments 4 replies

MaxHalford
Feb 24, 2023
Maintainer

mbignotti
Feb 27, 2023
Author

MaxHalford Feb 27, 2023
Maintainer

mbignotti Feb 27, 2023
Author

MaxHalford Feb 27, 2023
Maintainer

mbignotti Feb 27, 2023
Author