Algorithm Details

Add a snippet

Receive snippet data
Increment and get the value of `~~counter~
Add the snippet data to Redis
Index and score the snippet

Drop a snippet

Unindex then score the snippet
Add the snippet to the ~~recently-deleted sorted set with the score being the UTC timestamp

View dropped snippets

Clean the ~~recently-deleted set by removing any members older than 3 days
- Note that this leaves the underlying Redis string that holds the snippet data
Do a REVRANGEBYSCORE on the ~~recently-deleted sorted set, from 3 days ago until infinity from now
This returns a list of snippet IDs, so go get the snippets after that

Restore a snippet

Clean the ~~recently-deleted set
Remove the snippet from ~~recently-deleted
Index and score the snippet

Destroy a snippet

Clean the ~~recently-deleted set
Remove the snippet from ~~recently-deleted
Delete the snippet data

Edit a snippet

Delete the snippet
Re-add the snippet

This could clearly be optimized, but this is how it works for now.

Search a snippet

The goal is for no search to ever take more than 100ms (except for stress tests).

When a query is received, it is tokenized using the below process. The resulting tokens should have a pre-calculated sorted set of snippets and scores. So, we just combine all of those entries. If the search query had two tokens, the command would look like this: ZUNIONSTORE ~~results 2 [first token]-scores [second token]-scores.

Time complexity: O(S) + O(R log(R)) with S being the sum of the sizes of the input token score sorted sets, and R being the number of elements in the resulting ~~results sorted set.

Get the resulting snippet indices with this command: ZREVRANGEBYSCORE ~~results +inf 1 WITHSCORES LIMIT 25.

Time complexity: O(log(R)) with R being the number of elements in the ~~results sorted set.

Finally, go through each snippet and get the stringified JSON data with a simple GET [snippet index].

Time complexity: O(1), because the LIMIT on the number of snippets to GET is 25.

Total time complexity

So, the total complexity of the algorithm is O(S) + O(R log(R)) + O(log(R)) + O(1) = O(S) + O(R+1 log(R)) = O(S) + O(R log(R)), which is exactly the same complexity as the first step. So, that first step of the search algorithm is the bottleneck.

Is this a restrictive bottleneck? It definitely could be, yes. If the user searches for two terms that are used 1000s of times but never together, then R = S, so the complexity is O(S log(S)). This is not a horrible result but on enterprise scale systems where some terms can be used 1000000s of times, this would be unworkable. Remember that CheatSheet indexes the solution for terms, so if a very common word like "of" is not removed from the indexing process and it gets searched for, it could definitely be an issue.

Possible methods of getting speed-up

Some sort of caching could mitigate this, potentially. For every search query, created a new redis key that contains the results and that expires in 48 hours, or upon a user refresh. When a search comes in, first check the cache. If it is there, use it. Otherwise, do the normal search.

If a scored term has > 10000 results, then we rescore without the solution index. Then if it has > 10000 results again with only the keywords and problem, then we rescore with only the keywords. If it still has > 10000 results, then we keep it and figure out other ways of getting speed-up.

After scoring the results, get the top 1000 results and store them in a new redis set: [token]-scores-ranked, and search on this limited set instead of the entire [token]-scores. This would give an upper bound on S, therefore also an upper bound on R: 1000 * number of search tokens. I would guess this would be enough be redis to be quite performant.

Future features

One day we really should account for common spelling mistakes. Each term would go to a mapping table, and if it is a misspelled word, translate it and keep track of the translation, alerting the user.

Tokenizing process

Lowercase everything (including keywords)
Turn unneeded characters into whitespace
- replace(/[^\s\da-z]|(\s)/g, " ") - from list of allowed characters
Get rid of unneeded words
- replace(/\b(the)\b|\b(and)\b|\b(is)\b|\b(to)\b|\b(by)\b|\b(is)\b|\b(in)\b|\b(with)\b/g, "") - built from list of ignored words
Get rid of unneeded whitespace
- replace(/\s+/g, " ")
End up with just tokens separated by spaces
- split(" ")

Indexing process

Input is the ID of the snippet and an object with the problem tokens, solution tokens, and keywords
Indexing is simply adding the ID to each of the respective Redis sets
- problemTokens => [token]-problems
- solutionTokens => [token]-solutions
- keywords => [token]-keywords
Unindexing is just removing the ID from these sets

Scoring process

Takes an object that is the same as the indexing process object
For every single token in the object, run this command:
- ZUNIONSTORE [token]-scores 3 [token]-keywords [token]-problems [token]-solutions WEIGHTS 10 3 1
This combines the three index sets created above into one weighted set that can quickly be searched

Provide feedback

Saved searches

Use saved searches to filter your results more quickly