-
Documentation for tidy methods for all steps has been improved to describe the return value more accurately. (#262)
-
Calling
?tidy.step_*()
now sends you to the documentation forstep_*()
where the outcome is documented. (#261) -
step_textfeatures()
has been made faster and more robust. (#265) -
Fixed bug in
step_clean_levels()
where it would produce NAs for character columns. (#274)
-
textfeatures has been removed from Suggests. (#255)
-
step_textfeatures()
no longer returns a politeness feature. (#254)
step_untokenize()
andstep_normalization()
now returns factors instead of strings. (#247)
-
step_clean_names()
now throw an informative error if needed non-standard role columns are missing duringbake()
. (#235) -
The
keep_original_cols
argument has been added tostep_tokenmerge
. This change should mean that every step that produces new columns has thekeep_original_cols
argument. (#242) -
Many internal changes to improve consistency and slight speed increases.
-
Fixed bug where
step_dummy_hash()
andstep_texthash()
would add new columns before old columns. (#235) -
Fixed bug where
vocabulary_size
wasn't tunable instep_tokenize_bpe()
. (#239)
-
Steps with tunable arguments now have those arguments listed in the documentation.
-
All steps that add new columns will now informatively error if name collision occurs.
- Fixed bug where
step_tf()
wasn't tunable forweight
argument.
-
Setting
token = "tweets"
instep_tokenize()
have been deprecated due totokenizers::tokenize_tweets()
being deprecated. (#209) -
step_sequence_onehot()
,step_dummy_hash()
,step_dummy_texthash()
now return integers.step_tf()
returns integer whenweight_scheme
is"binary"
or"raw count"
. -
All steps now have
required_pkgs()
methods.
- Examples no longer include
if (require(...))
code.
- Indicate which steps support case weights (none), to align documentation with other packages.
-
Remove use of okc_text in vignette
-
Fix bug in printing of tokenlists
-
step_tfidf()
now correctly saves the idf values and applies them to the testing data set. -
tidy.step_tfidf()
now returns calculated IDF weights.
-
step_dummy_hash()
generates binary indicators (possibly signed) from simple factor or character vectors. -
step_tokenize()
has gotten a couple of cousin functionsstep_tokenize_bpe()
,step_tokenize_sentencepiece()
andstep_tokenize_wordpiece()
which wraps {tokenizers.bpe}, {sentencepiece} and {wordpiece} respectively (#147).
-
Added
all_tokenized()
andall_tokenized_predictors()
to more easily select tokenized columns (#132). -
Use
show_tokens()
to more easily debug a recipe involving tokenization. -
Reorganize documentation for all recipe step
tidy
methods (#126). -
Steps now have a dedicated subsection detailing what happens when
tidy()
is applied. (#163) -
All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
-
step_ngram()
has been given a speed increase to put it in line with other packages performance. -
step_tokenize()
will now try to error if vocabulary size is too low when usingengine = "tokenizers.bpe"
(#119). -
Warning given by
step_tokenfilter()
when filtering failed to apply now correctly refers to the right argument name (#137). -
step_tf()
now returns 0 instead of NaN when there aren't any tokens present (#118). -
step_tokenfilter()
now has a new argumentfilter_fun
will takes a function which can be used to filter tokens. (#164) -
tidy.step_stem()
now correctly shows if custom stemmer was used. -
Added
keep_original_cols
argument tostep_lda
,step_texthash()
,step_tf()
,step_tfidf()
,step_word_embeddings()
,step_dummy_hash()
,step_sequence_onehot()
, andstep_textfeatures()
(#139).
- Steps with
prefix
argument now creates names according to the patternprefix_variablename_name/number
. (#124)
- Fixed a bug in
step_tokenfilter()
andstep_sequence_onehot()
that sometimes caused crashes in R 4.1.0.
step_lda()
now takes a tokenlist instead of a character variable. See readme for more detail.
step_sequence_onehot()
now takes tokenlists as input.- added {tokenizers.bpe} engine to
step_tokenize()
. - added {udpipe} engine to
step_tokenize()
. - added new steps for cleaning variable names or levels with {janitor},
step_clean_names()
andstep_clean_levels()
. (#101)
- stopwords package have been moved from Imports to Suggests.
step_ngram()
gained an argumentmin_num_tokens
to be able to return multiple n-grams together. (#90)- Adds
step_text_normalization()
to perform unicode normalization on character vectors. (#86)
step_word_embeddings()
got a argumentaggregation_default
to specify value in cases where no words matches embedding.
step_tokenize()
got anengine
argument to specify packages other then tokenizers to tokenize.spacyr
have been added as an engine tostep_tokenize()
.step_lemma()
has been added to extract lemma attribute from tokenlists.step_pos_filter()
has been added to allow filtering of tokens bases on their pat of speech tags.step_ngram()
has been added to generate ngrams from tokenlists.step_stem()
not correctly uses the options argument. (Thanks to @grayskripko for finding bug, #64)
step_word2vec()
have been changed tostep_lda()
to reflect what is actually happening.step_word_embeddings()
has been added. Allows for use of pre-trained word embeddings to convert token columns to vectors in a high-dimensional "meaning" space. (@jonthegeek, #20)- text2vec have been changed from Imports to Suggests.
- textfeatures have been changed from Imports to Suggests.
step_tfidf()
calculations are slightly changed due to flaw in original implementation dselivanov/text2vec#280.
- Custom stemming function can now be used in step_stem using the custom_stemmer argument.
step_textfeatures()
have been added, allows for multiple numerical features to be pulled from text.step_sequence_onehot()
have been added, allows for one hot encoding of sequences of fixed width.step_word2vec()
have been added, calculates word2vec dimensions.step_tokenmerge()
have been added, combines multiple list columns into one list-columns.step_texthash()
now correctly acceptssigned
argument.- Documentation have been improved to showcase the importance of filtering tokens before applying
step_tf()
andstep_tfidf()
.
First CRAN version