-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible sort issue for times approaching 1970 #10
Comments
Currently the epoch is constructed from an arbitrary point in time (much later than 1970--somewhere close in time to when the process starts, in most cases). In other words, we aren't using the UNIX epoch anymore. Adding padding doesn't sound like a bad idea if indeed the BigInt isn't already padded. I'm curious if there's a particular source that describes this behavior? |
Just looking at the 64 bit timestamp, currently it would result in 7 base-62 encoded characters. It won't be until 2082 it will jump to 8 characters long (from the current 7 characters): I haven't checked it out for the full bytebuffer at 128 bits to see when the next digit appears, but it is possible it is a ways out. So your call. If someone chooses this flake format and needs to generate past or future ids (for whatever reason) this issue increases. Your base encoding function generates ids based on number size only, so nothing is padded. |
Perhaps the caller should pad the input? It might be dangerous to assume the padding should be implicitly padded. What do you think? |
I think people will assume the encoded id maintains sort order, so I'd pad it. I'd let them not pad it if they don't want the sort order guarantee. The padding (assuming it is with zeros) won't affect decoding the ID back into the original bigint, so the integrity of the id is not affected. |
FWIW, here is a pad function I use... I'm sure there is an optimization in there to be had: (defn pad-left
"Pad a string s to a length of l with character c."
[s l c]
(-> (- l (count s))
(repeat c)
(concat [s])
(#(apply str %)))) |
What length is the correct length to pad to in your opinion? My understanding is BigIntegers don't have a MAX_VALUE and instead are backed by an array of ints and thus their size limitations are correlated with available memory. |
Your id is 128 bits, so 22 characters. (-> (Math/pow 2 128)
(dec)
(base62-encode)
(count))
;;=> 22 |
Ah right, of course! Long day. I suppose after whatever point the BigInteger exceeds that length the IDs no longer work, but in theory that's far enough into the future to not matter. |
Yes, very very far into the future, which is why I mentioned before you can safely add precision greater than millis to your first 64 bits, if you wanted and were OK with a change to your ids... but at this point seems you are making some changes anyhow. :) |
Right, I think increasing the precision is fine, since in theory it doesn't break anything. Moving around the bits on the other hand does. :) By the way, I think the current character count varies from your estimate: generating IDs locally and coercing them to BigIntegers yields character counts of 32. user=> (count (str (flake/flake->bigint (flake/generate!))))
32 |
That is the length of your integer, and you should have no issues with the integer sort order. When you base-62 encode the integer, it is the resulting string sort order that can become incorrect. Per your example, this should apply (I didn't test it though): (count (encode/base62-encode (flake/generate!))) |
Doesn't it defeat the purpose of encoding if we end up making the encoded ID longer? |
I guess what I'm not clear about here is how we lose lexicographic ordering if the ID becomes longer (or shorter) relative to the timestamp bits? I don't think I've understood your original report and possibly it's just too late in the day and I'm missing the obvious. :) |
I'm only referring to the base-62 encoded flake id, not the bigint which should sort fine. The smaller your bigint, the shorter the characters... following is an example. (base62-encode 10)
;;=> "a"
(base62-encode 100)
;;=> "1C"
(sort ["1C" "a"])
;;=> ("1C" "a") ;; wrong!!
(sort ["1C" "0a"])
;;=> ("0a" "1C") ;; right!! - padded with a "0" to be same length. |
Okay, so your original issue was that if a timestamp occurred around 1970 IDs, specifically BigIntegers, would be short in length and thus sort incorrectly. Makes sense and yes I agree that it's probably worth padding. :) |
When base-encoding the bigint:
Since the "left" 64 bits are using milliseconds since epoch, and are technically allowed to go up to a value of (Long/MAX_VALUE), and also go down to zero, the character count of the final encoded bigint value can be as many as 22, but also much lower (if I've interpreted the source correctly).
Assuming the goal is always ensuring proper sort order, the value needs to left-padded with "0"s to the max character length (22 if I'm not mistaken).
The text was updated successfully, but these errors were encountered: