-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nicer syntax for file sizes #912
Comments
We would not introduce a special new type for file sizes, when integers suffice. That said, we did talk about adding many different multiplier suffixes for integers and floats back in #427. We considered Personally, I recoil at the thought of using a comment in place of a well-established suffix with a clear meaning. With all due respect to @pradyunsg, there are real problems with doing this, and comments can be deceptive. size1a = 10_300_000 # 10.3 MB, idk
size1b = 10_800_333 # 10.3 MB, idk
size1c = 10_547_000 # 10.3 MB, idk
size1z = 20e+06 # 10.3 MB, idk+idc I'd be open to hear a new proposal if it could garner a lot more support than the old proposal did. Make a suggestion or a PR, and rally your colleagues to show their support and spread the word. But maximize utility, expressiveness, and simplicity to make your case. |
My main issue with this idea is the distinction between MB and MiB. Personally I find the MiB syntax detestable and always treat filesizes as the sensible-in-technical-domains pow2 form. I know I'm not alone here. About the only time the (completely insane) base-10 form is acceptable when talking about file sizes is in the (unfortunately) already-established practice of hardware manufacturers trying to make their shit seem more powerful/bigger than it is (which I'm pretty sure is the root cause of the problem to begin with). Plus, what if someone wanted to write Note that it's only file sizes specifically that suffer from these dumb problems. The more general suffixes proposed in #427 would have avoided all that baggage. |
@marzer wrote:
ABNF uses case-insensitive matching strings by default. We'd pick up all those variants immediately just by specifying |
Ah, that's true. Guess that eliminates half of my qualms above. |
@JakobDev Could you give this issue a follow-up? It's been five months since your suggestion, and it prompted some discussion on that day. But without some additional interest, continued discussion, or a PR with concrete changes proposed, this issue may be closed like the one that came before it. |
I'm still interested in this, but I can't contribute much to the discussion here. |
I would like to add that if you would deserialize file sizes in programming languages then what type should it be converted to? Most programming languages do not have a type for file sizes. This brings in a lot of ambiguity because there are many valid interpretations. Do you parse |
It should be parsed to the number of bytes. The functions to get the file size in programming languages are usually return the number of bytes. e.g if you have a website that allows uploading files. you can set the max allowed file size: [Upload]
max-size = 10MB Pseudo python code: if os.path.getsize(path) > config["Upload"]["max-size"]:
show_error("Your file is too large")
if len(file_bytes) > config["Upload"]["max-size"]:
show_error("Your file is too large") If you want to calculate if a file is bigger/smaller than the given size, which will be the most common use case, it needs to be a number. It should be a long to support large sizes such as 1TB, if someone needs that.
If this is wanted, you can just use [Upload]
max-size = "4kb"
I don't know if there any language out there that has and I don't know why such a type is needed, but in this case, use the native filesize type of the language. |
File sizes are integers. There are no fractional bytes out there. Not touching kubits or anything! The strangulation point is whether kilobytes are 1000 bytes or 1024 bytes. Like it or not, it's that ambiguity in storage quantities that ended #427. We haven't allowed The reason we don't allow something like It really pains me that we can't have SI suffixes for integers, but it's those stupid ambiguous file sizes that make our lives difficult! |
New idea for solving the file-size ambiguity: K means 1000, KB means 1024. M means 1000000 (one million), MB means 1048576 (1024*1024). And so on. Rationale: if #427 was ultimately rejected because, as @eksortso says, "nobody knows what a kibibyte is, outside of our rarified circle", then let's change the syntax. A suffix without a trailing B is a kilo/mega/etc as used in scientific notation (powers of 10), whereas a suffix with a trailing B means "kilobyte/megabyte" with the traditional powers-of-2 meaning thereof (1024, 1048576, and so on).
My reasoning is that when people are working with bytes, powers of 1024 are what they expect. Even in technical circles, nobody actually says the word "kibibyte" out loud: we all talk about kilobytes even though we really do mean 1024 bytes, and nobody thinks this is ambiguous. In writing, we might write "kibibytes", but only in technical contexts where precision matters more than clear communication; most of the time if you see someone write "kilobytes" in an article, you expect that it means 1024 bytes unless the author specifically clarifies that it means 1000 bytes. So why not allow KB, MB, etc. to default to the power-of-2 meaning that everyone expects? The K, M, etc. suffixes will remain powers of 10, to allow Further parts of my proposal:
Floating-point values like 2.5MB could be treated in one of five ways:
I'm coming around to what @eksortso suggested here, which is to parse such values as floats, because it's impossible to tell which of ceiling or truncation rounding would be right for any given application. Also, one other way to handle values like 2.5MB would be to parse them as ints if unambiguous (2.5MB), but leave as floats if not a whole int (10.3MB). I believe this would be a very BAD idea, so I didn't include it in the list above, but it's worth at least mentioning if only to immediately reject the idea out of hand. |
One more consideration: the E ("exa", 10⁶) potentially conflicts with the E used in scientific notation. Adding this would make parsers more difficult to write, as encountering an E at the end of a number now has two possibilities:
Both can be parsed unambiguously, but it makes parsers slightly harder to write. We would also need to add a rule that you cannot use scientific notation and numeric suffixes together. No writing And another consideration: once you get up into the exabyte range, you're approaching the limits of 64-bit integers. TOML says that the values -2^63...2^63-1 (the natural range of signed 64-bit ints) must be accepted. But 8EB is 2^63, so any value of 8EB or greater would be too large to fit into a 64-bit int. My suggestion for handling this issue is the following:
|
I just don't think this should be added. It is not obvious. As was said before just do something like And another big problem with this is automatically writing a config file from something like a python dict. How would the writer know that it has to write a number back to something ending with kb. It wouldn't so then you would get the number of bytes. So this would only add something for reading filed but then as soon as you write to the file its gone. Just don't make it more complicated. |
Personally, I would not want binary suffixes for file sizes ( So let me suggest we scale back. What if we imposed the following?
Outside the standard, we would need to encourage designers to use multipliers only in fields where precise values can range by several factors, from single digits to quadrillions. And to use floats, like Is this useful enough to warrant its inclusion? Does it help users to express certain values more elegantly? Or is it still confusing or not obvious, even after the rules are laid down? |
@tintin10q wrote:
That's a problem to be solved by the emitters, not by the standard. And it already exists; would we write a thousand like |
IIUC, the current proposal is: file-size = 1M is translated to: {"file-size": 1000000} and file-size = 1MB is translated to: {"file-size": 1048576} # 1024*1024 I don't see why it isn't OK for an application to instead allow Quoting from #427 (comment):
As a crafted example, the proposal would have the following be valid: [observable-universe]
number-of-galaxies = { min-estimate = 100G, max-estimate = 2T } Those are going to result in "correct" values being serialised, but... I don't like that the format would allow doing things like this -- it isn't clearer. While I agree that it is somewhat common to have a need for file sizes, I don't think it's common enough to justify adding this sort of thing -- and it definitely is not worth adding a dedicated type for it. (thanks @rmunn for flagging that I goofed up on the examples here) |
(retitled to better reflect the underlying request here) |
Not quite. The current proposal is that |
Whoops, indeed. Thanks for flagging that -- I've edited my comment to fix that, since it doesn't change the fundamental argument I'm making in it. 😅 |
The fact that the problem already exists doesn't make it ok to make it worse.
I really agree with this. Do not make it more complicated. Normal numbers work just fine. Or just put the number of bytes in the tag name. file-size-mb = 1 |
Absolutely no one wants another type for file sizes. Among the proposals discussed, we are no doubt sticking with integers, even if we use floats to get to them. |
Alright lets close it then |
Well, for all the times I've requested feedback on this feature, nobody's wanted to speak up for it. Maybe KB's, MB's and GB's are useful and worth supporting across the sphere of configuration space. But if nobody's enthusiastic enough to raise their voice for it, there's no point in dwelling on it any longer. |
I want to speak up for this feature. I think the following "solution" is absolutely awful:
From the point of view of the TOML consumer, the application using this as a config file, this is terrible DX. Instead of just looking for one key called All of which is absolutely avoidable by simply having a single key file-size = "1 mb" Which then delegates responsibility for parsing the string to the TOML consumer. But if I'm using a config-file parsing library, it's because I didn't want to have to write code to parse strings. So TOML is failing to make life easy for me (again, speaking as the hypothetical app's developer) here. This is why the file-size suffix idea keeps coming back. Because nothing else in TOML currently can quite replace it. |
Your terrible dx comes from that you would allow multiple file size inputs for both kb and mb and gb in the config file etc. Just only allow |
Pretty much all of my remarks about durations also apply to sizes: #514 (comment) The proposal I mentioned in that comment also implemented size units (considering both are essentially the same, "a number with a suffix", I felt it was useful to consider both at the same time). An additional issue for sizes is that there is rarely an obvious type to parse things in to. "Just" parse to an int of bytes won't work, most stdlibs don't have anything for this, so you really do have to implement your own |
Let me reiterate that we want to be as obvious to our users as possible. Implementers will have more work to do in order to make something that users can pick up quickly. In short, we should always favor Your analysis on #514 was very helpful, though as far as bit and byte sizes are concerned, we're talking about a situation where users and developers are the same people, and you cite popular systems where kb/mb/gb suffixes are supported, even with fractional (float) values using the suffixes. And in such instances, the end type ultimately derived will be an integer and nothing else. That's my interpretation, anyway, and we've discussed implementation details wyd considerable length already. Are these popular enough that we ought to fold this syntax into the TOML standard for our technical users? Is it worth it, to them and to all of us, for all TOML users to have access to this easy-to-read shorthand? |
If it's only an int with no additional information, then as a TOML application, how will I distinguish between "user really wants a small cache" vs. "they forget to add a suffix"? In Python all I see is And "upgrading" something like So you really do need to do something special as applications need to be able to tell if it's a regular number, or a number with a suffix. It was discussed, sure, but I don't think anyone realized the full details. I didn't either until I actually implemented it (this is why people should really write/prototype code instead of discussing implementation details in the abstract). That's not necessarily a show-stopper, but it does make things a tad harder, especially for application authors, and it's all a bit non-obvious at a glance. At the very least the changelog for this should make some notes about this, so implementers are aware of the potential issue before they start, and can communicate it to their users (application authors). |
And "do something special" seems simple, e.g. Python can use:
Or something along these lines, applications can then use
Because the Size subclass of int gets "duck typed" to a regular int, it will appear to "just work" unless the application author thinks of this possibility, which is easy enough to forget. So, maybe the simple sublclass isn't sufficient, and you want a |
My own take is that if file size syntax does not convert to an integer, then we are absolutely making things more complicated than they need to be. We want to be programming language agnostic, so special classes for file sizes are out of the question because they're not obvious. Let's hold off until after the next release, and we can revisit this in a different light. Unless (as I keep asking) we get a lot more use cases in here to persuade us to move on it sooner than v1.1.0. |
I have used similar thing in two places: In these applications I absolutely love the ability to use tagged integers. Both of these allow for interacting/validating/working with data which is then presented as static config format in the final step. I think the best articulated view for my argument would be Cap'n Proto: FAQ where the author explain why "Required" is not available in Cap'n Proto, with a similar argument. It boils down to keeping the layers of problem separate. I don't like the idea of adding the feature by principle, but it would definitely be a nice to have. |
TOML is supposed to be simple without bloat. A specific type for filesizes just doesn't make sense. Intigers suffice for filesizes |
It would be nice to have Filesizes in TOML.
Things like max Filesize etc. are used in many configurations and it should be easy to implement for the different parsers.
The text was updated successfully, but these errors were encountered: