LDA - how to obtain top terms per topic? #830

mdscruggs-onna · 2022-01-26T22:41:48Z

mdscruggs-onna
Jan 26, 2022

👋 hi all! Love the library and excited to see the merger of creme + multiflow!

https://riverml.xyz/latest/api/preprocessing/LDA/

I've been trying out the online LDA model for topic modeling, and I'm uncertain about the right way to obtain ranked terms per topic (so I can understand which terms are the most relevant). The weights in nu_1 and nu_2 appear relevant, but it's not clear if I can directly use those or how to combine them per-topic/term.

Any help is welcome! Many thanks.

Answered by raphaelsty

Jan 27, 2022

Hi @mdscruggs-onna 😀,

Unfortunately the LDA does not provide a method that returns the top tokens per topic.

Here's a solution I'm thinking of and that gives interesting results. Rather than using nu_1 and nu_2 directly, I would use the _compute_weights method which combines nu_1 and nu_2 to provide weights per token and topic with respect to LDA algorithm. After getting the weights, I do think it might be relevant to apply the softmax function.

import numpy as np
from river import compose, feature_extraction, preprocessing

X = [
    "weather cold",
    "weather hot dry",
    "weather cold rainy",
    "weather hot",
    "weather cold humid",
]

lda = compose.Pipeline(
    feature_extraction

View full answer

raphaelsty · 2022-01-27T16:03:50Z

raphaelsty
Jan 27, 2022
Maintainer

Hi @mdscruggs-onna 😀,

Unfortunately the LDA does not provide a method that returns the top tokens per topic.

Here's a solution I'm thinking of and that gives interesting results. Rather than using nu_1 and nu_2 directly, I would use the _compute_weights method which combines nu_1 and nu_2 to provide weights per token and topic with respect to LDA algorithm. After getting the weights, I do think it might be relevant to apply the softmax function.

import numpy as np
from river import compose, feature_extraction, preprocessing

X = [
    "weather cold",
    "weather hot dry",
    "weather cold rainy",
    "weather hot",
    "weather cold humid",
]

lda = compose.Pipeline(
    feature_extraction.BagOfWords(),
    preprocessing.LDA(
        n_components=2, number_of_documents=25, maximum_size_vocabulary=1000, seed=42
    ),
)

for _ in range(5):
    for x in X:
        lda = lda.learn_one(x)
        topics = lda.transform_one(x)
        print(x, topics)

weather cold {0: 1.5, 1: 1.5}
weather hot dry {0: 0.5, 1: 3.5}
weather cold rainy {0: 0.5, 1: 3.5}
weather hot {0: 2.5, 1: 0.5}
weather cold humid {0: 1.5, 1: 2.4999999999999996}
weather cold {0: 0.5, 1: 2.5}
weather hot dry {0: 3.5, 1: 0.5000000000000004}
weather cold rainy {0: 0.5, 1: 3.5}
weather hot {0: 0.5, 1: 2.5}
weather cold humid {0: 0.5, 1: 3.5}
weather cold {0: 0.5, 1: 2.5}
weather hot dry {0: 3.5, 1: 0.5}
weather cold rainy {0: 1.5, 1: 2.5000000000000004}
weather hot {0: 1.5, 1: 1.5}
weather cold humid {0: 0.5, 1: 3.5000000000000004}
weather cold {0: 0.5, 1: 2.5}
weather hot dry {0: 3.5, 1: 0.5}
weather cold rainy {0: 0.5, 1: 3.5}
weather hot {0: 2.5, 1: 0.5}
weather cold humid {0: 0.5, 1: 3.5}
weather cold {0: 0.5, 1: 2.5}
weather hot dry {0: 3.5, 1: 0.5}
weather cold rainy {0: 0.5, 1: 3.5}
weather hot {0: 1.5, 1: 1.5}
weather cold humid {0: 0.5, 1: 3.5}

Here is the softmax function applied on the weights token / topic. Top tokens seems relevant

weights, _ = lda["LDA"]._compute_weights(
    nu_1=lda["LDA"].nu_1, nu_2=lda["LDA"].nu_2, n_components=lda["LDA"].n_components
)

def tokens_components(weights, index_to_word: dict, k: int) -> dict:
    top = {}
    for component, weight in weights.items():
        score = np.exp(weight) / sum([np.exp(w) for c, w in weights.items() if c != component])
        top[component] = [
            index_to_word[token] for token in np.argsort(-score)[:k] if token in index_to_word
        ]
    return top

tokens_components(weights=weights, index_to_word=lda["LDA"].index_to_word, k=2)

{0: ['hot', 'dry'], 1: ['cold', 'weather']}

3 replies

mdscruggs-onna Jan 27, 2022
Author

@raphaelsty thanks for the response! I had found the _compute_weights method but hadn't thought of applying softmax, that makes total sense and the results look reasonable.

This would make a helpful addition to the API, my time is pretty thin at the moment so hopefully someone can run with that suggestion.

mdscruggs-onna Jan 27, 2022
Author

Related question for you @raphaelsty -- is it appropriate to select the best topic for a document by choosing the topic with the highest value in the output of transform_one() ? For example, if the results of transform_one() are as below, I would select topic 2 as the most related topic for the input document:

{0: 0.0,
 1: 1.0,
 2: 2.0}

raphaelsty Jan 27, 2022
Maintainer

I think we could select the topic with the highest score. I do not know if it would be relevant to go through a clustering algorithm using cluster module of river after the LDA. To be honest, I'm not sure which is the best option to choose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA - how to obtain top terms per topic? #830

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

LDA - how to obtain top terms per topic? #830

mdscruggs-onna Jan 26, 2022

Replies: 1 comment · 3 replies

raphaelsty Jan 27, 2022 Maintainer

mdscruggs-onna Jan 27, 2022 Author

mdscruggs-onna Jan 27, 2022 Author

raphaelsty Jan 27, 2022 Maintainer

mdscruggs-onna
Jan 26, 2022

Replies: 1 comment 3 replies

raphaelsty
Jan 27, 2022
Maintainer

mdscruggs-onna Jan 27, 2022
Author

mdscruggs-onna Jan 27, 2022
Author

raphaelsty Jan 27, 2022
Maintainer