This repository has been archived by the owner on Apr 4, 2023. It is now read-only.
Refactor of the data extractors used during indexing #656
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is mostly still an experimental, unpolished PR. It refactors the data extractors used during indexing.
The main ideas are to:
rayon
and usestd::thread::scope
andstd::thread::spawn
insteadThe advantages of this approach should be to:
Sorter
s used by the extractors will be able to process larger batches of documents before flushing their content to disk.rayon
, which removes a large dependency and makes the indexing code much easier to profile using tools such as flamegraphs and Instruments.appHowever, on machines with very small amount of RAMs, indexing documents which have many words in them could result in more file system pressure. This is because the extractors now have to share the same pool of memory between them, and so the
word_pair_proximity
extractor in particular could use all of its allocated memory faster, which would make it flush its content to disk more often.Remaining work to be done
For now, the number of threads is hard-coded to 8 and the available memory is hardcoded to 8GB. This should of course be changed to use the actual number of threads and memory available during indexing.
The amount of memory that each extractor is allowed to use is also hardcoded. For example, we give half of the total memory to the
word_pair_proximity
extractor. But ideally, this number should be determined based on the actual amount that each extractor needs to process the specific dataset. For example, thesongs
andgeo
datasets do not have a lot of word pairs, so it is counterproductive to allocate half of the memory to theword_pair_proximity
extractor for them.Proper error handling should be reintroduced
Benchmark results
The benchmarks so far are satisfactory. We gain up to ~20% in the
movies
benchmarks, probably due to the increased parallelism. In thewiki
benchmark, we gain ~10%, probably because we read/write/merge fewergrenad::Reader
. Thegeo
andsongs
datasets perform slightly worse (- 3–10%), which I think could be due to the unfair memory allocation between extractors.However, for this PR, it won't be enough to rely on these benchmarks, since the performance will change a lot based on the OS, number of threads, and memory that the machine has. For example, on my computer (MacBook Pro M1, 16GB RAM), indexing the
wiki
datasets is 4 times faster. There are probably some configurations which cause greater performance losses as well.I also expect the performance benefits will grow after #639 is merged.
Proposed solution to allocate memory fairly between extractors
I'd like to introduce a new data structure, called
SorterPool
, which can manageN
grenad::Sorter
and allocate memory fairly between them.Initially, they each receive
MEM / N
memory. When any of the sorter in the pool exceeds its allocated memory, it flushes its content in a chunk. At that point, we need to tell the other sorters to also flush their content. We record how much memory each sorter used. Then, we reallocate a different amount of memory for each sorter based on its previous memory usage.Example
We have 4 sorters, S1..S4, and access to 4GB of memory. Initially. each sorter has access to 1GB of memory.
We start inserting elements in each sorter. After a while, we have this situation:
After S1 flushed its content in memory, we also ask S2..S4 to do the same. Then we reallocate the 4GB of memory as follows: