You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am investigating performance problems in load.corpus. I think that performance could be improved significantly by replacing scan with another approach to loading files.
This flame graph from profiling load.corpus shows that most of the run time is accounted for by scan
I ran a benchmark comparing the call to scan in load.corpus with two other functions for reading text files, readChar and readLines.
readChar runs on the same file in ~10% of the time. However, while the current approach returns each text split on '\n', this function returns each file's contents as a single string.
Do the downstream downstream text processing functions (make.samples, txt.to.words.ext, delete markup, etc) require each text to be split into lines? If they do, maybe we could modify txt.to.words.ext or another downstream function to handle that step in one of the tokenization loops that already occurs.
The text was updated successfully, but these errors were encountered:
Swapped readChar for scan in a8bf057 on /experimental. load.corpus now runs almost instantly on my 1000 item corpus, and load.corpus.and.parse completes without errors. Still warrants more investigation to make sure this approach is compatible.
I am investigating performance problems in
load.corpus
. I think that performance could be improved significantly by replacingscan
with another approach to loading files.This flame graph from profiling
load.corpus
shows that most of the run time is accounted for byscan
I ran a benchmark comparing the call to
scan
in load.corpus with two other functions for reading text files,readChar
andreadLines
.readChar
runs on the same file in ~10% of the time. However, while the current approach returns each text split on '\n', this function returns each file's contents as a single string.Do the downstream downstream text processing functions (
make.samples
,txt.to.words.ext
,delete markup
, etc) require each text to be split into lines? If they do, maybe we could modifytxt.to.words.ext
or another downstream function to handle that step in one of the tokenization loops that already occurs.The text was updated successfully, but these errors were encountered: