You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
First of all, thank you very much for developing this package!
I was wondering if it was possible in the token filtering step (step_tokenfilter()) to implement a method based on the relative and not absolute frequency of a token's appearance. For example, here's my code:
data_train <- tibble(sentence = c("This is words", "They are nice !", "Pretty pretty pretty good !", "Another sentence that is in doc2"),
doc = c("doc1", "doc2", "doc2", "doc2"))
data_rec <- recipe(x ~ sentence, data = data_train) %>%
step_tokenize(sentence) %>%
step_stopwords(sentence, custom_stopword_source = stopwords_list) %>%
step_tokenfilter(sentence, max_tokens = 1000) %>%
step_tfidf(sentence)
Indeed, the absolute frequency-based filter is largely influenced by the potential length of the document in which the sentences are found. The longer the text, the more likely the token will appear. It would be interesting to be able to filter tokens based on relative frequency (based on the total number of words in each value of the doc variable in the example) before calculating the tf-idf on these extracted tokens, wouldn't it?
Thx in advance ;)
The text was updated successfully, but these errors were encountered:
Hello,
First of all, thank you very much for developing this package!
I was wondering if it was possible in the token filtering step (
step_tokenfilter()
) to implement a method based on the relative and not absolute frequency of a token's appearance. For example, here's my code:Indeed, the absolute frequency-based filter is largely influenced by the potential length of the document in which the sentences are found. The longer the text, the more likely the token will appear. It would be interesting to be able to filter tokens based on relative frequency (based on the total number of words in each value of the
doc
variable in the example) before calculating the tf-idf on these extracted tokens, wouldn't it?Thx in advance ;)
The text was updated successfully, but these errors were encountered: