-
Notifications
You must be signed in to change notification settings - Fork 11
Discussion: Unix Timestamp Granularity in UUIDv7 #23
Comments
I vote to leave at 36 in the current implementation, I would rather future proof as much as possible with a larger number than reduce the size and truncate a 64-bit timestamp more. That being said: Depending on how we choose to encode |
Yes, the 36-bit length was definitely done intentionally in order to avoid running out of numbers - but I do agree it's annoying to use. Another option would be to use a 32-bit unsigned integer, which runs out in the year 2106. I'm not super excited about an end date that is close enough for my grandchildren to curse me for when it runs out on them, but it's an option and would definitely be simpler to use. I'll write more on this on #24 in a moment. |
I think 32 bits even if unsigned are not enough. 2106 is not that far in the future. I can also see the benefit of keeping the value unsigned because generating a UUID before 1970-01-01 shouldn't be a real life use case. I definitely want to avoid introducing a new epoch. The Unix epoch makes implementations much easier. I don't even know why did they decide to use the Gregorian epoch in case of UUIDv1. For me anything from 2514 looks good. |
@nerg4l @kyzer-davis If we did the 111 var field from #26 for UUID7, that gives us the first 64 bits clear and free of any other requirements. What if we, despite how clever I thought I was being with this subsecond encoding stuff (I still like the idea but I don't like how much confusion it has created), instead just use a 64-bit nanosecond-precise unix timestamp. I'm not sure if this shoudl replace UUIDv7 or maybe be a separate version (we could do UUID9 ;) ). But regardless, I'm just trying to see if there is a way to simplify this whole thing down so when people read the spec it's like "oh of course, that makes sense" instead of "what the..." So the new bit layout would be:
Where And then we basically just say basically "implementations can parse out the time if the they like, but the other fields are essentially write-only and are a recommendation - don't try to read these because it's not invalid for implementations make changes after the ver-var field. And factually if implementations wanted to encode random stuff into the least significant bytes of the nanosecond-precise timeestamp - I don't think that should even be explicitly forbidden - it's just like "if you do this, other implementations may get the wrong timestamp - use at your own risk". I think this would be dramatically simpler and easier to understand. |
@bradleypeabody I think the variable subsecond handling in UUIDv7 was added because not all languages can return unix timestamp with nanosecond precision eg. JavaScript, PHP. Changing the position of the version bit would affect some wildly used libraries. Here are some examples:
Edit: It is always quite unfortunate when there are already existing implementations. |
Another important thing to keep in mind is how many UUIDs can be generated per unit of time. The more of a type of UUID that can be generated in a given interval, the better. I did some simple math to get the maximum number of UUIDs version 1 that can be generated in a nanosecond interval. I treated all non-timestamp bits as random bits and obtained the value
|
Wouldn't it be
? BTW, your calculator is wrong a bit. |
Yes. I made a silly mistake in the beginning. All the rest of the calculation is wrong! I will fix this. Thanks for noticing! |
Now any UUIDv7 that we have talked about is better than UUIDv1. I use Python's
Note: in this simple (naive?) calculation, all bits that are not timestamps are considered random.
Using the General Birthday FormulaThe General Birthday Formula applied to the UUID types:
The General Birthday Formula: import math
from decimal import Decimal
def birthday_formula(t):
return Decimal(math.sqrt(-2 * math.log(1-1/2)) * math.sqrt(t))
# Example on terminal
>>> uuid1 = Decimal(2**62) / Decimal(10**2)
>>> birthday_formula(uuid1)
Decimal('252846877.0343293845653533935546875') The formula calculates the number of UUIDs which need to be generated in order to have a 50% probability of at least one collision. Sources:
Update (2021-08-15)
|
IMO the random part should be as big as possible so keeping 4 leading zero bits is a weird idea.
Agree. The next option is 33, but the odd number of bits is a bit odd, lol.
I like it.
Is ok too. |
I played around with the 40-bit timestamps suggested by Brad in another issue. If we use 40 bits with 10-milliseconds or centisecond precision:
Layout:
Let me know what you think and if I made another mistake. PS: the layout shown here is based on ULID with sequence presented by @sergeyprokhorenko in the issue #27. |
LGTM
I don't like this idea. This allows you to predict the next UUID value, which could be a security issue. |
I think that human response time is an outdated benchmark. We'd better think about data streems from many IoT sourses. Therefore the more unique are timestamp + sequence the better, because incoming UUIDs are more ordered. In most cases we can concider clock sequence empty or null. So the more different timestamps per second the better. Therefore we'd better take timestamp precision not more than round trip network request in same data centre (500 microseconds). But we have to pay for high precision by increasing the bit depth of the timestamp. So I think the 1 millisecond timestamp precision is enough. |
Yes. It's important to keep IoT in mind. I updated my comment #23 (comment) to include two new timestamp alternatives:
Also included a table with values generated using the General Birthday Formula. All the bits that are not from timestamp are considered as random in that calculation. |
Unsure that your calculation is correct. IMO it is better to calculate something like annual collision probability when every nanosecond new UUID is generated. I use this formula from wiki to calculate probability per tick: For UUIDv7-40b-10msec The annual ( For UUID v4 it will be one big tick, |
Excellent, @edo1! Now we have a mathematical tool to help us evaluate all the options and trade offs involved. I don't care if my calculations were wrong if it encouraged someone to do something a lot better. |
Have made a spreadsheet with some UUID variants.
In all cases, all effective non-timestamp bits are considered random. The target is You could make a copy of the spreadsheet and play around with constants/formulas. |
Great, @edo1 ! We can adjust the life span of the UUIDv7 in the field "years after now", right? It's interesting how the number of years affects the probability of collision. But why don't we use 200 years? The life span of 120 shows that UUIDv7 with 39 bits and centisecond precision is the best option. But I prefer not using this number of bits to prevent someone storing this type of UUID in a SIGNED integer field. If this occurs, the "last date" becomes 2057, not 2144. If we adjust the life span to 200 years, the "best" option becomes 40 bits (5 whole bytes), and the last date becomes 2318 (unsigned) or 2144 (signed)(witch is 123 years after now). Please change the fields "stolen bits" of UUID variant 3 to 3. The bit sequence for this variant is '111'. |
Actually, I do not exclude the option of making the epoch even shorter.
And I don't expect a very long lifespan for 128 bit UUIDs. It is very likely that in 50 or 100 years, 256-bit identifiers (or something radically new) will be used. |
I think we could start a new epoch from 2021 instead of use of Unix time |
160-bit identifiers would be great right now. By the way, 160-bit identifiers in Crockford's base32 would be the same length as 128-bit identifiers in usual UUID string format. Therefore they are compartible. |
Is the spreadsheet correct? The "when every nanosecond new UUID is generated" part in case of ns precision would mean that every nanosecond the time is incremented and because of that the probability of collision is theoretically 0. |
...
The way I read RFC4122, it discourages fields that are a non-integral number of octets in length. (I.e. field bit lengths should be a multiple of 4 bits. So 32, 36, 40 are okay. 34 and 38 are not.) From the RFC:
I know this is just expository text in the RFC, but I believe it's well-founded guidance. Sub-octet field definitions make it difficult for humans to interpret a UUID. Witness how the hex-digit containing the |
You are right. Initially, I assumed that Also added the second sheet with graphs, they clearly show collision probability mainly depends on bits count and on epoch length, not on timestamp/random split. |
In https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md, I'm proposing that we use an unsigned 64-bit nano-seconds since the Unix epoch timestamp (and move the version field out of the way - so it's exactly the first 8 bytes). Max timestamp value is 2554-07-21T23:34:33.999999999Z. The rationale is:
Thoughts? |
Draft 03 Timestamp Granularity now features a one-stop-stop section of all the lessons learned from our threads on timestamp granularity. |
Spreadsheet has been updated again to match the final version. In this category, the situation is clearly worse: the annual collision probability at 1,000,000,000 UUIDs per second is more than 80% (or ≈0.17% at 1,000,000 UUIDs per second).
|
@edo1 Something seems wrong here. Why would the probability of collision increase over time for timestamp-based UUIDs? For such UUIDs, collisions can only occur for ids with the same timestamp. I.e. collisions are scoped to a specific timestamp interval, thus, it doesn't really matter how many intervals there are. The only factor should be the UUID creation rate, since that's what effects how many UUIDs are generated per interval. For example, with "UUID v7 final" creating 109 UUIDs / second corresponds to 106 UUIDs / timestamp interval. Doing the birthday problem math for that (106 random 73-bit values) I get a collision probability of 5.3 x 10-11. I.e. I think "P collisions/year" should be 0.0000000053%, not 82%. Edit to add: This is one reason why comparing v4 and v1 (or, now, v7) is always a bit of an apples-to-oranges comparison: v4 depends on total # of ids produced, v1 depends on creation rate. (Generally speaking, the former is more collision resistant until you've generated some very-large number of ids.) |
Correct
No. This is the probability that a collision will occur per millisecond. Now for a year: The chance of not having a single collision per year is less than 20%. |
The collision chance for the random part is indeed high, but since the timestamp is part of the ID, there will be no ID collision, right? |
@martinheidegger there will be no collisions of UUIDs with different timestamps, you are right. |
Implementing UUIDv7 properly, only UUID's created within the same ms / ns will have the same timestamp?! Why try to calculate it for the timeframe of a year? |
Ah, right. Birthday problem logic still applies based on the number of time intervals. 'Forgot about that. (This isn't the first time I've made this mistake. 😝 ) @martinheidegger: I believe @edo1 is correct here. Timestamps do prevent collision across time intervals, but there's a non-zero chance the random portion of ids in the same interval will collide. Add up the odds of all the [astronomic number of] permutations where one or more intervals might contain a collision and you get the All: So what's the consensus opinion on the fact v7 has poorer collision resistance than v1 or other versions? Worth worrying about? We could shorten the timestamp field to 44-bits, as discussed previously. That brings the probability down to ~10%. But... does that 8x improvement from 80% -> 10% really matter? We're dealing with such extreme use cases here that improvements that are less than an order of magnitude in nature probably won't move the needle much in terms of how we do / don't satisfy user requirements. |
If there is only one UUID generator per database table, then the combination of the timestamp with the counter guarantees 100% that collisions will never occur. It's like autoincrement. We don't even need a pseudo-random segment. If there are multiple generators, then adding the generator ID to the right of the UUID will give the same 100% effect. This is allowed by clause 6.9 of the published Internet-Draft. All calculations above ignore the number of UUID generators per database table. Therefore, all these calculations are wrong. The statement that v7 has poorer collision resistance than v1 or other versions is false. The more UUID generators per database table, the higher the chance of a collision. With a small number of UUID generators, counters are more effective at preventing collisions than pseudo-random segments. Counters act like timestamps. |
Do you mean 6.3?
Or do you mean increasing the UUID length with additional fields? Yes, it can solve this problem. But this breaks the idea of compatibility with RFC4122, which is already implemented in DBMS, e.g. PostgreSQL uuid, MS SQL uniqueidentifier. |
No. I mean 6.9: "DBMS vendors are encouraged to provide functionality to generate and store UUID formats defined by this specification for use as identifiers or left parts of identifiers...". It's the recommended UUIDv7
I mean the extra segments (additional random segment + metadata) to the right of the UUID in the long concatenated identifier.
Not always. It depends on the circumstances. The database developer can allocate a field in the database table for the long concatenated identifier, or he/she can break the long identifier into UUID and metadata. Сompatibility is not always good. I previously wanted a 160-bit UUID with metadata inside, but a concatenated identifier is an even more flexible technical solution. UUID is a standardized feature of the UUID generator (of the DBMS), while concatenated identifier is a custom feature of a particular database table. We should not adapt to the limitations of the current DBMS. On the contrary, DBMS developers will have to adapt to new UUID versions. |
"1,000,000,000 UUIDs per second" collision testI would assume the crux of the issue is UUIDv7 features MS resolution while UUIDv1 has 100-NS. With this we are paused on the timestamp for longer while the rest of the UUID entropy is doing it's thing. That is, UUIDv7's timestamp bit does not advance as fast as UUIDv1 which means more entropy rolls per timestamp bit tick. Thus more rolls = more collision probability.
What does this look like if we say: "generate 1 million UUID per timestamp bit tick" so the actual real-world time does not matter and we are comparing entropy characteristics only. This would be a fair comparison since both timestamps are paused for 1 million random entropy generations. Run that test on both the same N number of times to keep it fair. My theory is that this has to favor UUIDv7 because I have a hard time wrapping my head around how 48 bit timestamp vs 60 bit timestamp where there are less entropy bits comes out to worse entropy characteristics when there are more bits dedicated to that that in UUIDv7 vs UUIDv1. Thinking about 44 vs 48 @broofa discussed: It is true, we are talking about the extremes. I don't see why we further reduce the timestamp in favor of more entropy which counteracts the point above where we now pause on each tick longer increasing entropy rolls and thus more chances for a collision. Could the other method prove better? e.g. Take Unix TS to 64-bit Nanoseconds and reduce entropy bits? This means the TS will advance faster and we roll entropy less each tick. Now that all being said, remember that we should be using an embedded counter from Section 6.2 for same-timestamp batch UUIDv7 creation and this is one of the perfect scenarios to illustrate it's usefulness. Here each counter tick within the frozen timestamp ensures we have a unique bit-space for the UUID entropy generation (with bonuses on storability as a secondary characteristics.) Let's run through a scenario: In this scenario implementers should select an increment method from Draft 03:
Then select an increment type. (To keep things easy let's say plus one increment type) Embedded counter using N-1 rollover guard and plus one increment:First UUID
Second UUID
...Repeat for next set of UUID up to the millionth or timestamp moves forward. Incrementing rollover counter where required. Alternative method for random as counter with plus one increment type:First UUID
Second UUID
...Repeat for next set of UUID up to the millionth or timestamp moves forward. |
OMG! There was a foolish error in the formula (the wrong row number was substituted during copy-paste). Only ≈3 UUIDs per millisecond can be considered safe (less than 1:1,000,000,000 collision chance in 50 years). |
Thanks @edo1, One other problem in the spreadsheet: UUIDv7 in Draft 03 has 74 bits of entropy ( That being said, I think the overall test is moot because it ignores the other characteristics of the UUIDv7 draft which solve the collision problems via methods other than increasing/decreasing timestamp or entropy. That is, section 6.2 solves bulk UUID generation within a single timestamp via embedded counter logic. If we are doing bulk gen as this test implied then we should be selecting a counter method and increment type. Assuming the counter does not overflow, we should have 0 collisions. Furthermore, Section 6.3 can be layered onto 6.2 and solve the problem for adding N number of nodes. Assuming everybody has a unique ID; our bitpace does not conflict with anybody else within the application context removing collisions. |
I meant the mechanism from this paragraph:
Of course, in the case of the centralized UUID generation, it is easy to avoid collisions.
In many cases, it is impossible/nearly impossible to assign each node a short unique ID. And doesn't this paragraph mean that UUID v7 cannot use embedded node ID? |
@edo1, great discussion. Personally I would love something we could provide to folks that can be used to test the "required number of bits for fixed-length counter length within their real-world application" e.g "I want to be able to generate 1,000,000 UUIDs per MS timestamp Tick" so on paper using 21 bit fixed-length counter along with N-1 Rollover Guard should be sufficient; but in reality the application compute can only generate 500,000 per MS tick resulting in some wasted counter bits that could have been better allocated to entropy. My key points on the counters:
Distributed Node Generation:
|
|
@sergeyprokhorenko, Unfortunately my verbiage in this document is highly restricted by RFC2119; of which the descriptions are anything but unambiguous. SHOULD seems to fit the bill for what I am trying to convey but if there is a reason to convert it to MAY that is also fine with me. Both copied from RFC 2119 for reference below:
The other point I am trying to make here is that UUIDv8 is a totally valid use case for edge scenarios like this test is conveying. UUIDv7 in Draft 03 works for majority of use cases. If an application implementer vets v7 and needs make some changes, there is nothing stopping them from forking v7 into v8, making those changes and calling it a day. UUIDv8 gives them the opportunity to do what they please while having a standard compliant UUID. |
@kyzer-davis, There is a good example in section 6.9. DBMS and Database Considerations: You can use it for this text: "Implementations that choose to leverage an embedded node id are encouraged to utilize UUIDv8 for more flexible layout. They MAY also choose to concatenate the UUID with the node id (and other elements) on the right into a single identifier". Italics is my addition. |
You are right, of course. Sorry. Anyway, I dislike the idea of embedded ID as a generic approach, it is complicated and/or error-prone (requires proper ID length size selection, correct unique ID generation, ID reuse, and so on). It might be used in some specific cases, of course.
I tried to say it in #60 (comment) IMO should be 3 main options (golang is used in examples):
For PostgreSQL it could be This will handle almost all cases of UUID generation. Only rarely is a more sophisticated approach required. |
Added UUID v7 with 74-bit randomness into the spreadsheet, it's the penultimate prone to collisions. |
Why I prefer UUID v7: UUIDs from different origins are still monotonic/almost monotonic. UUID v8 makes no guarantees at all. |
I read most of the conversation, I'm trying to figure out if I could switch from fully-random (16 random bytes) uuids to v7, however am I reading it right? The safe value is as low as 5000 UUIDs per second? Why did they choose to allocate so much space to the timestamp? It looks very irrational, considering that we already had 8 uuid releases in 20 years, to think that this one will be the ultimate version that will last 8.000 years... |
sofigio, I advise you to look at more advanced options of UUIDv7: x4m/pg_uuid_next#1 (comment) . Even without a random part at all (alike autoincrement), they guarantee the absence of collisions and the high speed of generating UUIDs. But they also have a fairly long random part too. All lazy programmers rushed to implement the most primitive UUIDv7 structure. |
Question: Too many or too much?
Current Draft-02 implementation:
Alternatives Proposed:
The text was updated successfully, but these errors were encountered: