Multi-Label classification with one label per bit of training data + ClassifierChain learning classes on the fly? #487
-
Hi there, I'm attempting to use river for some online multi label classification. The incoming data comes from a user assigning a class to a tagged object. More than one class can be assigned, but they would be done separately at any time. Per #347 I've tried spinning up the ClassifierChain Firstly - it seems to me that I have to provide all the potential classes in the first training example? I've set it up here to initially train with 'class3' and 'class1' with the first training instance, and then the second and third ones are 'class2' - looking at the output, only class3 and class1 show up from river import compose
from river import multioutput
from river import preprocessing
from river import feature_extraction as fx
from river import naive_bayes
docs = [
('php phish malicious threatstream-confidence-85 http com threatstream-severity-very-high malicious-activity', 'class1'),
('http overflow private public', 'class2'),
('x11 linux request private code public shadow', 'class2')
]
binarizer = preprocessing.Binarizer(threshold=0)
bow = fx.BagOfWords()
model = compose.Pipeline(
('tokenize', fx.BagOfWords()),
('binarize', preprocessing.Binarizer(threshold=0, dtype=int)),
('classifier', multioutput.ClassifierChain(model=naive_bayes.BernoulliNB()))
)
classes = {'class3'} # Adding in one which doesn't exist to see what happens
for sentence, label in docs:
classes.add(label)
training = dict.fromkeys(classes, False) # {'foo': false}
training[label] = True # Sets the value to true for that particular label
model = model.learn_one(sentence, training)
print(training)
model.predict_proba_one('facebook phish malicious x11 threatstream-confidence-85 login com threatstream-severity-very-high malicious-activity') Output:
The output shows that all three are being passed in by that last example data, but the predict proba only has the two classes. Question: Is there a way to get this classifier chain to learn new labels/classes as it encounters them? I was hoping it would learn them as it got the data in! Secondly, is this a valid approach for a problem where I am only getting one label with each bit of data? I could theoretically get
coming down the pipeline, and I imagine this would be training the classifiers against and for the classes with the given features? If not, would anyone be able to point me in the correct direction as to how to achieve this? Thank you! Ben PS - I also have data coming in as user feedback on classifications in the form of thumbs up and downs for classifications - so in a perfect world I'd keep the data pretty sparse and only say that a set of features corresponds to true/false for the given class, rather than having to fill in all the possible classes in the training input! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Sadly for the moment no. I guess it could be done!
No, you have to group the labels together in the same observation. I suppose you could "hold" the data while you wait for all the labels to arrive. |
Beta Was this translation helpful? Give feedback.
-
@benwhale I've just merged a fix into the master branch. |
Beta Was this translation helpful? Give feedback.
Sadly for the moment no. I guess it could be done!
No, you have to group the labels together in the same observation. I suppose you could "hold" the data while you wait for all the labels to arrive.