LDA - how to obtain top terms per topic? #830
-
👋 hi all! Love the library and excited to see the merger of creme + multiflow! https://riverml.xyz/latest/api/preprocessing/LDA/ I've been trying out the online LDA model for topic modeling, and I'm uncertain about the right way to obtain ranked terms per topic (so I can understand which terms are the most relevant). The weights in Any help is welcome! Many thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @mdscruggs-onna 😀, Unfortunately the LDA does not provide a method that returns the top tokens per topic. Here's a solution I'm thinking of and that gives interesting results. Rather than using import numpy as np
from river import compose, feature_extraction, preprocessing
X = [
"weather cold",
"weather hot dry",
"weather cold rainy",
"weather hot",
"weather cold humid",
]
lda = compose.Pipeline(
feature_extraction.BagOfWords(),
preprocessing.LDA(
n_components=2, number_of_documents=25, maximum_size_vocabulary=1000, seed=42
),
)
for _ in range(5):
for x in X:
lda = lda.learn_one(x)
topics = lda.transform_one(x)
print(x, topics)
Here is the softmax function applied on the weights token / topic. Top tokens seems relevant weights, _ = lda["LDA"]._compute_weights(
nu_1=lda["LDA"].nu_1, nu_2=lda["LDA"].nu_2, n_components=lda["LDA"].n_components
)
def tokens_components(weights, index_to_word: dict, k: int) -> dict:
top = {}
for component, weight in weights.items():
score = np.exp(weight) / sum([np.exp(w) for c, w in weights.items() if c != component])
top[component] = [
index_to_word[token] for token in np.argsort(-score)[:k] if token in index_to_word
]
return top
tokens_components(weights=weights, index_to_word=lda["LDA"].index_to_word, k=2) {0: ['hot', 'dry'], 1: ['cold', 'weather']} |
Beta Was this translation helpful? Give feedback.
Hi @mdscruggs-onna 😀,
Unfortunately the LDA does not provide a method that returns the top tokens per topic.
Here's a solution I'm thinking of and that gives interesting results. Rather than using
nu_1
andnu_2
directly, I would use the_compute_weights
method which combinesnu_1
andnu_2
to provide weights per token and topic with respect to LDA algorithm. After getting the weights, I do think it might be relevant to apply the softmax function.