-
Notifications
You must be signed in to change notification settings - Fork 1
Algorithm Details
- Receive snippet data
- Increment and get the value of `~~counter~
- Add the snippet data to Redis
- Index and score the snippet
- Unindex then score the snippet
- Add the snippet to the
~~recently-deleted
sorted set with the score being the UTC timestamp
- Clean the
~~recently-deleted
set by removing any members older than 3 days- Note that this leaves the underlying Redis string that holds the snippet data
- Do a REVRANGEBYSCORE on the
~~recently-deleted
sorted set, from 3 days ago until infinity from now - This returns a list of snippet IDs, so go get the snippets after that
- Clean the
~~recently-deleted
set - Remove the snippet from
~~recently-deleted
- Index and score the snippet
- Clean the
~~recently-deleted
set - Remove the snippet from
~~recently-deleted
- Delete the snippet data
- Delete the snippet
- Re-add the snippet
This could clearly be optimized, but this is how it works for now.
The goal is for no search to ever take more than 100ms (except for stress tests).
When a query is received, it is tokenized using the below process. The resulting tokens should have a pre-calculated sorted set of snippets and scores. So, we just combine all of those entries. If the search query had two tokens, the command would look like this: ZUNIONSTORE ~~results 2 [first token]-scores [second token]-scores
.
- Time complexity:
O(S) + O(R log(R))
withS
being the sum of the sizes of the input token score sorted sets, andR
being the number of elements in the resulting ~~results sorted set.
Get the resulting snippet indices with this command: ZREVRANGEBYSCORE ~~results +inf 1 WITHSCORES LIMIT 25
.
- Time complexity:
O(log(R))
withR
being the number of elements in the ~~results sorted set.
Finally, go through each snippet and get the stringified JSON data with a simple GET [snippet index]
.
- Time complexity:
O(1)
, because theLIMIT
on the number of snippets toGET
is 25.
So, the total complexity of the algorithm is O(S) + O(R log(R)) + O(log(R)) + O(1)
= O(S) + O(R+1 log(R))
= O(S) + O(R log(R))
, which is exactly the same complexity as the first step. So, that first step of the search algorithm is the bottleneck.
Is this a restrictive bottleneck? It definitely could be, yes. If the user searches for two terms that are used 1000s of times but never together, then R = S, so the complexity is O(S log(S))
. This is not a horrible result but on enterprise scale systems where some terms can be used 1000000s of times, this would be unworkable. Remember that CheatSheet indexes the solution for terms, so if a very common word like "of" is not removed from the indexing process and it gets searched for, it could definitely be an issue.
Some sort of caching could mitigate this, potentially. For every search query, created a new redis key that contains the results and that expires in 48 hours, or upon a user refresh. When a search comes in, first check the cache. If it is there, use it. Otherwise, do the normal search.
If a scored term has > 10000 results, then we rescore without the solution index. Then if it has > 10000 results again with only the keywords and problem, then we rescore with only the keywords. If it still has > 10000 results, then we keep it and figure out other ways of getting speed-up.
After scoring the results, get the top 1000 results and store them in a new redis set: [token]-scores-ranked
, and search on this limited set instead of the entire [token]-scores
. This would give an upper bound on S, therefore also an upper bound on R: 1000 * number of search tokens. I would guess this would be enough be redis to be quite performant.
One day we really should account for common spelling mistakes. Each term would go to a mapping table, and if it is a misspelled word, translate it and keep track of the translation, alerting the user.
- Lowercase everything (including keywords)
- Turn unneeded characters into whitespace
- replace(/[^\s\da-z]|(\s)/g, " ") - from list of allowed characters
- Get rid of unneeded words
- replace(/\b(the)\b|\b(and)\b|\b(is)\b|\b(to)\b|\b(by)\b|\b(is)\b|\b(in)\b|\b(with)\b/g, "") - built from list of ignored words
- Get rid of unneeded whitespace
- replace(/\s+/g, " ")
- End up with just tokens separated by spaces
- split(" ")
- Input is the ID of the snippet and an object with the problem tokens, solution tokens, and keywords
- Indexing is simply adding the ID to each of the respective Redis sets
- problemTokens => [token]-problems
- solutionTokens => [token]-solutions
- keywords => [token]-keywords
- Unindexing is just removing the ID from these sets
- Takes an object that is the same as the indexing process object
- For every single token in the object, run this command:
- ZUNIONSTORE [token]-scores 3 [token]-keywords [token]-problems [token]-solutions WEIGHTS 10 3 1
- This combines the three index sets created above into one weighted set that can quickly be searched