Possible sort issue for times approaching 1970 #10

bplatz · 2016-06-22T00:57:39Z

When base-encoding the bigint:

Since the "left" 64 bits are using milliseconds since epoch, and are technically allowed to go up to a value of (Long/MAX_VALUE), and also go down to zero, the character count of the final encoded bigint value can be as many as 22, but also much lower (if I've interpreted the source correctly).

Assuming the goal is always ensuring proper sort order, the value needs to left-padded with "0"s to the max character length (22 if I'm not mistaken).

maxcountryman · 2016-06-22T01:02:33Z

Currently the epoch is constructed from an arbitrary point in time (much later than 1970--somewhere close in time to when the process starts, in most cases). In other words, we aren't using the UNIX epoch anymore.

Adding padding doesn't sound like a bad idea if indeed the BigInt isn't already padded. I'm curious if there's a particular source that describes this behavior?

bplatz · 2016-06-22T01:15:33Z

Just looking at the 64 bit timestamp, currently it would result in 7 base-62 encoded characters.

It won't be until 2082 it will jump to 8 characters long (from the current 7 characters):
(-> (java.time.Instant/parse "2082-01-01T00:00:00Z")
(.toEpochMilli)
(base62-encode)
count)
;;=> 8

I haven't checked it out for the full bytebuffer at 128 bits to see when the next digit appears, but it is possible it is a ways out. So your call.

If someone chooses this flake format and needs to generate past or future ids (for whatever reason) this issue increases.

Your base encoding function generates ids based on number size only, so nothing is padded.

maxcountryman · 2016-06-22T01:22:35Z

Perhaps the caller should pad the input? It might be dangerous to assume the padding should be implicitly padded. What do you think?

bplatz · 2016-06-22T01:37:21Z

I think people will assume the encoded id maintains sort order, so I'd pad it.

I'd let them not pad it if they don't want the sort order guarantee.

The padding (assuming it is with zeros) won't affect decoding the ID back into the original bigint, so the integrity of the id is not affected.

bplatz · 2016-06-22T01:41:41Z

FWIW, here is a pad function I use... I'm sure there is an optimization in there to be had:

(defn pad-left
  "Pad a string s to a length of l with character c."
  [s l c]
  (-> (- l (count s))
      (repeat c)
      (concat [s])
      (#(apply str %))))

maxcountryman · 2016-06-22T01:46:26Z

What length is the correct length to pad to in your opinion? My understanding is BigIntegers don't have a MAX_VALUE and instead are backed by an array of ints and thus their size limitations are correlated with available memory.

bplatz · 2016-06-22T01:50:00Z

Your id is 128 bits, so 22 characters.

(-> (Math/pow 2 128)
    (dec)
    (base62-encode)
    (count))
;;=> 22

maxcountryman · 2016-06-22T01:50:57Z

Ah right, of course! Long day. I suppose after whatever point the BigInteger exceeds that length the IDs no longer work, but in theory that's far enough into the future to not matter.

bplatz · 2016-06-22T01:52:46Z

Yes, very very far into the future, which is why I mentioned before you can safely add precision greater than millis to your first 64 bits, if you wanted and were OK with a change to your ids... but at this point seems you are making some changes anyhow. :)

maxcountryman · 2016-06-22T02:05:31Z

Right, I think increasing the precision is fine, since in theory it doesn't break anything. Moving around the bits on the other hand does. :)

By the way, I think the current character count varies from your estimate: generating IDs locally and coercing them to BigIntegers yields character counts of 32.

user=> (count (str (flake/flake->bigint (flake/generate!))))
32

bplatz · 2016-06-22T02:10:12Z

That is the length of your integer, and you should have no issues with the integer sort order.

When you base-62 encode the integer, it is the resulting string sort order that can become incorrect.

Per your example, this should apply (I didn't test it though):

(count (encode/base62-encode (flake/generate!)))

maxcountryman · 2016-06-22T02:11:08Z

Doesn't it defeat the purpose of encoding if we end up making the encoded ID longer?

maxcountryman · 2016-06-22T02:13:16Z

I guess what I'm not clear about here is how we lose lexicographic ordering if the ID becomes longer (or shorter) relative to the timestamp bits?

I don't think I've understood your original report and possibly it's just too late in the day and I'm missing the obvious. :)

bplatz · 2016-06-22T02:21:17Z

I'm only referring to the base-62 encoded flake id, not the bigint which should sort fine.

The smaller your bigint, the shorter the characters... following is an example.

(base62-encode 10)
;;=> "a"

(base62-encode 100)
;;=> "1C"

(sort ["1C" "a"])
;;=> ("1C" "a") ;;  wrong!!

(sort ["1C" "0a"])
;;=> ("0a" "1C") ;;  right!! - padded with a "0" to be same length.

maxcountryman · 2016-06-22T02:25:37Z

Okay, so your original issue was that if a timestamp occurred around 1970 IDs, specifically BigIntegers, would be short in length and thus sort incorrectly.

Makes sense and yes I agree that it's probably worth padding. :)

xsc mentioned this issue Jul 9, 2016

update flake xsc/kithara#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible sort issue for times approaching 1970 #10

Possible sort issue for times approaching 1970 #10

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016 •

edited

Loading

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016 •

edited

Loading

maxcountryman commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

Possible sort issue for times approaching 1970 #10

Possible sort issue for times approaching 1970 #10

Comments

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016 • edited Loading

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016 • edited Loading

maxcountryman commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016

maxcountryman commented Jun 22, 2016

bplatz commented Jun 22, 2016 •

edited

Loading

bplatz commented Jun 22, 2016 •

edited

Loading