Skip to content

Algorithm Details

Tim Schwab edited this page Apr 18, 2019 · 5 revisions

Add a snippet

  • Receive snippet data
  • Increment and get the value of `~~counter~
  • Add the snippet data to Redis
  • Index and score the snippet

Drop a snippet

  • Unindex then score the snippet
  • Add the snippet to the ~~recently-deleted sorted set with the score being the UTC timestamp

View dropped snippets

  • Clean the ~~recently-deleted set by removing any members older than 3 days
    • Note that this leaves the underlying Redis string that holds the snippet data
  • Do a REVRANGEBYSCORE on the ~~recently-deleted sorted set, from 3 days ago until infinity from now
  • This returns a list of snippet IDs, so go get the snippets after that

Restore a snippet

  • Clean the ~~recently-deleted set
  • Remove the snippet from ~~recently-deleted
  • Index and score the snippet

Destroy a snippet

  • Clean the ~~recently-deleted set
  • Remove the snippet from ~~recently-deleted
  • Delete the snippet data

Edit a snippet

  • Delete the snippet
  • Re-add the snippet

This could clearly be optimized, but this is how it works for now.

Search a snippet

The goal is for no search to ever take more than 100ms (except for stress tests).

When a query is received, it is tokenized using the below process. The resulting tokens should have a pre-calculated sorted set of snippets and scores. So, we just combine all of those entries. If the search query had two tokens, the command would look like this: ZUNIONSTORE ~~results 2 [first token]-scores [second token]-scores.

  • Time complexity: O(S) + O(R log(R)) with S being the sum of the sizes of the input token score sorted sets, and R being the number of elements in the resulting ~~results sorted set.

Get the resulting snippet indices with this command: ZREVRANGEBYSCORE ~~results +inf 1 WITHSCORES LIMIT 25.

  • Time complexity: O(log(R)) with R being the number of elements in the ~~results sorted set.

Finally, go through each snippet and get the stringified JSON data with a simple GET [snippet index].

  • Time complexity: O(1), because the LIMIT on the number of snippets to GET is 25.

Total time complexity

So, the total complexity of the algorithm is O(S) + O(R log(R)) + O(log(R)) + O(1) = O(S) + O(R+1 log(R)) = O(S) + O(R log(R)), which is exactly the same complexity as the first step. So, that first step of the search algorithm is the bottleneck.

Is this a restrictive bottleneck? It definitely could be, yes. If the user searches for two terms that are used 1000s of times but never together, then R = S, so the complexity is O(S log(S)). This is not a horrible result but on enterprise scale systems where some terms can be used 1000000s of times, this would be unworkable. Remember that CheatSheet indexes the solution for terms, so if a very common word like "of" is not removed from the indexing process and it gets searched for, it could definitely be an issue.

Possible methods of getting speed-up

Some sort of caching could mitigate this, potentially. For every search query, created a new redis key that contains the results and that expires in 48 hours, or upon a user refresh. When a search comes in, first check the cache. If it is there, use it. Otherwise, do the normal search.

If a scored term has > 10000 results, then we rescore without the solution index. Then if it has > 10000 results again with only the keywords and problem, then we rescore with only the keywords. If it still has > 10000 results, then we keep it and figure out other ways of getting speed-up.

After scoring the results, get the top 1000 results and store them in a new redis set: [token]-scores-ranked, and search on this limited set instead of the entire [token]-scores. This would give an upper bound on S, therefore also an upper bound on R: 1000 * number of search tokens. I would guess this would be enough be redis to be quite performant.

Future features

One day we really should account for common spelling mistakes. Each term would go to a mapping table, and if it is a misspelled word, translate it and keep track of the translation, alerting the user.

Tokenizing process

  • Lowercase everything (including keywords)
  • Turn unneeded characters into whitespace
    • replace(/[^\s\da-z]|(\s)/g, " ") - from list of allowed characters
  • Get rid of unneeded words
    • replace(/\b(the)\b|\b(and)\b|\b(is)\b|\b(to)\b|\b(by)\b|\b(is)\b|\b(in)\b|\b(with)\b/g, "") - built from list of ignored words
  • Get rid of unneeded whitespace
    • replace(/\s+/g, " ")
  • End up with just tokens separated by spaces
    • split(" ")

Indexing process

  • Input is the ID of the snippet and an object with the problem tokens, solution tokens, and keywords
  • Indexing is simply adding the ID to each of the respective Redis sets
    • problemTokens => [token]-problems
    • solutionTokens => [token]-solutions
    • keywords => [token]-keywords
  • Unindexing is just removing the ID from these sets

Scoring process

  • Takes an object that is the same as the indexing process object
  • For every single token in the object, run this command:
    • ZUNIONSTORE [token]-scores 3 [token]-keywords [token]-problems [token]-solutions WEIGHTS 10 3 1
  • This combines the three index sets created above into one weighted set that can quickly be searched