Performance in load.corpus #36

adunmore · 2020-02-26T20:59:16Z

I am investigating performance problems in load.corpus. I think that performance could be improved significantly by replacing scan with another approach to loading files.

This flame graph from profiling load.corpus shows that most of the run time is accounted for by scan

I ran a benchmark comparing the call to scan in load.corpus with two other functions for reading text files, readChar and readLines.

readChar runs on the same file in ~10% of the time. However, while the current approach returns each text split on '\n', this function returns each file's contents as a single string.

Do the downstream downstream text processing functions (make.samples, txt.to.words.ext, delete markup, etc) require each text to be split into lines? If they do, maybe we could modify txt.to.words.ext or another downstream function to handle that step in one of the tokenization loops that already occurs.

The text was updated successfully, but these errors were encountered:

adunmore · 2020-02-26T21:10:51Z

Swapped readChar for scan in a8bf057 on /experimental. load.corpus now runs almost instantly on my 1000 item corpus, and load.corpus.and.parse completes without errors. Still warrants more investigation to make sure this approach is compatible.

adunmore · 2020-03-04T16:52:50Z

Both delete.markup and txt.to.words can accept individual texts as whole strings. So the existing code is compatible with my approach in a8bf057.

I think this code is ready to be merged with the main branch.

adunmore referenced this issue Feb 26, 2020

replaced scan() with readChar() in load.corpus.R

a8bf057

adunmore mentioned this issue Mar 4, 2020

Experimental #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance in load.corpus #36

Performance in load.corpus #36

adunmore commented Feb 26, 2020

adunmore commented Feb 26, 2020

adunmore commented Mar 4, 2020

Performance in load.corpus #36

Performance in load.corpus #36

Comments

adunmore commented Feb 26, 2020

adunmore commented Feb 26, 2020

adunmore commented Mar 4, 2020