Multi-Label classification with one label per bit of training data + ClassifierChain learning classes on the fly? #487

benwhale · 2021-02-25T15:08:55Z

benwhale
Feb 25, 2021

Hi there,

I'm attempting to use river for some online multi label classification.

The incoming data comes from a user assigning a class to a tagged object. More than one class can be assigned, but they would be done separately at any time.

Per #347 I've tried spinning up the ClassifierChain

Firstly - it seems to me that I have to provide all the potential classes in the first training example? I've set it up here to initially train with 'class3' and 'class1' with the first training instance, and then the second and third ones are 'class2' - looking at the output, only class3 and class1 show up

from river import compose
from river import multioutput
from river import preprocessing
from river import feature_extraction as fx
from river import naive_bayes

docs = [
    ('php phish malicious threatstream-confidence-85 http com threatstream-severity-very-high malicious-activity', 'class1'),
    ('http overflow private public', 'class2'),
    ('x11 linux request private code public shadow', 'class2')
]


binarizer = preprocessing.Binarizer(threshold=0)
bow = fx.BagOfWords()

model = compose.Pipeline(
    ('tokenize', fx.BagOfWords()),
    ('binarize', preprocessing.Binarizer(threshold=0, dtype=int)),
    ('classifier', multioutput.ClassifierChain(model=naive_bayes.BernoulliNB()))
)

classes = {'class3'} # Adding in one which doesn't exist to see what happens

for sentence, label in docs:
    
    classes.add(label)
    
    training = dict.fromkeys(classes, False) # {'foo': false}
    
    training[label] = True # Sets the value to true for that particular label
            
    model = model.learn_one(sentence, training)
    
print(training)
    
model.predict_proba_one('facebook phish malicious x11 threatstream-confidence-85 login com threatstream-severity-very-high malicious-activity')

Output:

{'class1': False, 'class3': False, 'class2': True}

{'class1': {True: 0.999981419024769, False: 1.8580975231396847e-05},
 'class3': {False: 1.0}}

The output shows that all three are being passed in by that last example data, but the predict proba only has the two classes.

Question: Is there a way to get this classifier chain to learn new labels/classes as it encounters them? I was hoping it would learn them as it got the data in!

Secondly, is this a valid approach for a problem where I am only getting one label with each bit of data? I could theoretically get

    ('http overflow private public', 'class1'),
    ('http overflow private public', 'class2'),

coming down the pipeline, and I imagine this would be training the classifiers against and for the classes with the given features?

If not, would anyone be able to point me in the correct direction as to how to achieve this?

Thank you!

Ben

PS - I also have data coming in as user feedback on classifications in the form of thumbs up and downs for classifications - so in a perfect world I'd keep the data pretty sparse and only say that a set of features corresponds to true/false for the given class, rather than having to fill in all the possible classes in the training input!

Answered by MaxHalford

Feb 25, 2021

Is there a way to get this classifier chain to learn new labels/classes as it encounters them?

Sadly for the moment no. I guess it could be done!

Secondly, is this a valid approach for a problem where I am only getting one label with each bit of data?

No, you have to group the labels together in the same observation. I suppose you could "hold" the data while you wait for all the labels to arrive.

View full answer

MaxHalford · 2021-02-25T16:48:42Z

MaxHalford
Feb 25, 2021
Maintainer

Is there a way to get this classifier chain to learn new labels/classes as it encounters them?

Sadly for the moment no. I guess it could be done!

Secondly, is this a valid approach for a problem where I am only getting one label with each bit of data?

No, you have to group the labels together in the same observation. I suppose you could "hold" the data while you wait for all the labels to arrive.

2 replies

benwhale Feb 26, 2021
Author

@MaxHalford Thanks for the quick reply!

On the first question - I was going from the comment here from September about it learning new classes without retraining on the whole data set? #347 (comment)_

On the second question - ah ok! Because a user could add a new label to a bit of data at any time, we won't be able to wait for it to be complete.

I reckon I'll need to spin up a new binary classifier for each label as I discover them and go from there to get potentially multi labels (or no labels if applicable)

MaxHalford Feb 26, 2021
Maintainer

I've added to my todo list to look into the first question this weekend :)

MaxHalford · 2021-03-06T21:24:10Z

MaxHalford
Mar 6, 2021
Maintainer

@benwhale I've just merged a fix into the master branch. ClassifierChain and RegressorChain will now correctly work with new and missing outputs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Label classification with one label per bit of training data + ClassifierChain learning classes on the fly? #487

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multi-Label classification with one label per bit of training data + ClassifierChain learning classes on the fly? #487

benwhale Feb 25, 2021

Replies: 2 comments · 2 replies

MaxHalford Feb 25, 2021 Maintainer

benwhale Feb 26, 2021 Author

MaxHalford Feb 26, 2021 Maintainer

MaxHalford Mar 6, 2021 Maintainer

benwhale
Feb 25, 2021

Replies: 2 comments 2 replies

MaxHalford
Feb 25, 2021
Maintainer

benwhale Feb 26, 2021
Author

MaxHalford Feb 26, 2021
Maintainer

MaxHalford
Mar 6, 2021
Maintainer