Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally supported implementation-defined values #707

Closed
yyny opened this issue Feb 17, 2020 · 17 comments
Closed

Optionally supported implementation-defined values #707

yyny opened this issue Feb 17, 2020 · 17 comments

Comments

@yyny
Copy link

yyny commented Feb 17, 2020

There have been several proposals to add new types/syntax for TOML values.

I personally don't like adding more types to the core of TOML, even date/time overdid it IMHO (Makes it harder to write a dependency-free parser in low-level languages that don't have a built-in "Date" type), but I do see the need for custom types.

I think there should be an extension to the TOML spec to allow for implementation-defined value syntax, so that's what I'm proposing here.

Proposal

A value which does not have the syntax of any of the currently existing basic types is considered "implementation defined".
implementation-defined values are defined as follows:

An implementation defined value...

  • MUST start right after an equal sign following a key
    • MUST NOT be used as table keys or array names
  • MUST contain at least a single character
    • No empty values
  • MUST NOT begin with one of ""'{[" characters
    • These keep their usual meaning
  • SHOULD NOT begin with true, false
  • SHOULD NOT begin with valid date tokens
  • SHOULD NOT begin with valid number tokens
    • Implementations MAY ignore these recommendation, e.g. 10ms and 100% can be very useful
  • MUST have matching delimiters (e.g. ( is an error).
  • MUST NOT allow unmatched closing delimiters (e.g. ) is an error).
  • MUST NOT allow unpaired opening- and closing-delimiters (e.g. ( { ) } is an error).
  • MUST end right before a newline character when not inside unmatched delimiters.
  • MAY fail to parse when encountering a newline inside unmatched delimiters.
  • MUST have greater indentation on every line than the indentation at the start of the key for every line.
  • SHOULD have greater indentation on every line than the indentation of the equal sign.
    • MAY throw an error otherwise
  • MUST parse quoted strings as usual (even inside unmatched delimiters).
    • MUST process escapes (That is, print " " and print "\x20" are exactly the same implementation-defined token).
  • MUST parse # comments and ignore them (but not the newline after the comment).
  • MUST not parse digits in any special way.
    • MAY implement implementation-defined values starting with number tokens differently.
      • Example: might throw an error when finding delimiters after numbers.
      • Example: might pass a number instead of a string to registered callbacks.
  • MUST be allowed inside inline tables and arrays.
  • MUST give the usual meaning the commas and ]} closing delimiters while inside inline-tables and inline-arrays when not inside unmatched delimiters (e.g. { name = implementation("defined") } is a valid inline-table).
  • MUST parse a \ and the character following it seperately.
    • Example: May escape \" quote to prevent starting string.
      • Whether or not the \ is removed in this case is IMPLEMENTATION DEFINED

Additionally:

  • SHOULD be invalid for at least some valid parse inputs.
    • Implementation-defined values are not a replacement for strings.
  • SHOULD remove comments.
  • SHOULD remove spaces before comments.
  • SHOULD remove lines containing only spaces and comments.
  • SHOULD remove indentation according to least-indented line.
  • SHOULD condense multiple spaces outside of unmatched delimiters into one.
  • SHOULD remove leading and trailing spaces before processing.
    • That is, an implementation-defined value with or without leading/trailing space should produce the same result.
  • SHOULD give the same meaning to the same implementation-defined value regardless of context.
  • MAY provide a registered callback with the raw token as it appears in the TOML file in addition to the preprocessed string/token list

The current "Date" syntax would be moved from a "required type" to a "recommended implementation-defined type".
This way, implementations that do not care about date values can simply ignore them and throw an error, without breaking TOML compliance.
Other proposed types can also be added as recommendations.

Implementation

Parser implementations that do not need implementation-defined values can simply throw an error when they encounter one.
Additionally, this means that all old TOML parsers would still be standard-compliant.

Parsing libraries that want to allow for implementation-defined values should allow the parser user to register a callback that gets called whenever an implementation-defined value token is recognized. They should get a string (or list of tokens) containing the full implementation-defined token that was parsed, with certain pre-processing as outlined above (e.g. removing leading and trailing spaces). These callbacks can then either return nothing/return an error to signal they do not recognize the string, or an implementation-defined value.

Parsing libraries may restrict implementation-defined values, for example, only allow values that look like function syntax: expr(1 + 2) # This may search for a callback registered as "expr"

Multi-line values, indentation counting and preprocessing are really nice features, but implementations that do not wish to implement them can ignore them.
Additionally, implementations may wish to preprocess implementation-defined values however they see fit (e.g. remove leading space or ignore spaces completely), including no preprocessing at all.

Syntax Highlighting

The biggest drawback to this proposal is that a lot of syntax highlighters do not yet support this syntax.
Some syntax highlighters are not advanced enough to support some of the syntax proposed here at all (For example, counting leading indentation)
In addition, many current syntax highlighters clearly mark unrecognized tokens as invalid.
This nice feature would no longer be possible in a mix of implementation-defined and implementation-free TOML.
I don't consider this a big problem, since the syntax is implementation-defined by definition, and syntax highlighting should (mostly) be a concern for the implementation, not for the TOML spec.

Examples

expression = 10 + 20
expression-also = compute '10 + 20'
function-call = expr(10 + 20)
also-function-call = expr(10
           + 20) # same as above
range-value = range(0, 10, 2)
time = 10ms
enum-value = yes
tuple = (1, 2, 3)
allowed-but-discouraged =
    Unquoted string syntax... sort of
    We have to trust that leading spaces are filtered out by implementation...
    Additionally, "quotes" are still recognized.
allowed-but-discouraged = HashMap
    { key1 = "a"
    , key2 = "b"
    } # indentation lower than equal sign, but greater than key
inside-inline = [ 10ms, { value = 1 + 1 } ]
# not-allowed = [1, 2, 3] teehe not really an array
# not-allowed = # empty value
# not-allowed = :) # unmatched delimiter
# not allowed = function_call(
# ) # invalid indentation

List of proposals that propose new types/value syntax:

Glossary

Details

Delimiter

One of "[](){}"

Paired delimiters

One of:

  • "[]"
  • "{}"
  • "()"

Opening Delimiter

One of "[{("

Closing Delimiter

One of "]})"

Matching Delimiter

An opening delimiter followed at some point by the closing delimiter in the same pair, with matching delimiters inside

Key

Standard TOML token

key = value
^ The key token starts here

Equal sign

Standard TOML token

key = value
    ^ The equal sign is here, and has an indentation of "4"

Registered callback

A function that gets called by the implementation when an implementation-defined value has been parsed

@jaskij
Copy link

jaskij commented Nov 26, 2020

Seconded. As an embedded developer I see value in using TOML over other languages (well-defined spec and end-users familiar with INI syntax), but creating a compliant implementation it while targeting tiny microcontrollers without even a full C standard library would likely be a no-go. At best the implementation would be fully compliant while allowing the user to opt out of some parts of the spec.

Devices I have in mind would likely read that file from a pendrive or an SD card and likely have no graphical interface too. Note that both C and C++ have very relaxed requirements when it comes to this tiny devices. Even malloc() doesn't have to be present, sprintf() and sscanf() also often rely on calls which must be implemented by the user.

@pradyunsg
Copy link
Member

@jaskij what's the blocker for your use case? If you want to use a well defined subset, there's nothing stopping you from doing so. It won't be implementing the TOML spec in it's entirely but that's fine given the constraints of the environment.

This issue is asking for more flexibility than TOML provides today. Are you asking for more flexibility somehow?

@jaskij
Copy link

jaskij commented Nov 26, 2020

@pradyunsg I might have worded it badly - for embedded use case a strong requirement for datetimes and time of day, which are often not used in those systems, is, if not a blocker, then at least undesired. I'm thinking of devices which have something like 256 kiB total - every kilobyte is precious. I wouldn't be surprised if users would not want to include date handling code. Making them optional would allow such an implementation to be compliant without undue burden. If I know from the onset that I cannot make my implementation compliant why even start writing it?

And yes, personally, I put great emphasis on allowing an implementation to call itself compliant.

In a different use case (this time on Linux) I do use time spans which with this proposal would make implementation-defined. Embedded systems use time spans regularly.

I might have been a little misguided earlier, but yes, I do need more flexibility - at least in making timezones and datetimes optional.

@arp242
Copy link
Contributor

arp242 commented May 18, 2022

You can put things in strings:

function-call = "expr(10 + 20)"
tuple         = "(1, 2)"

And then the implementation can process that however it likes.

range-value = range(0, 10, 2) and tuple = (1, 2) would of course look nicer, but we'd have to modify the specification to say "almost anything is allowed after a =", and this makes it really hard to tell if something is a valid TOML document or not, and "valid" will mean "valid according to this specific interpreter".

At the very least, it would need a different syntax than key = value, for example "start with !"

function-call = !expr(10 + 20)

Or maybe using a different key/value separator such as :=:

tuple := (1, 2)

But then you still have the problem "what kind of value is this?" The only way you know that the tuple key is a tuple is if your implementation either knows about the (..) syntax, or if you told the implementation "parse the tuple key as a tuple". The same applies to any other value.

In YAML this is fixed by telling it what object type to use:

!!python/object/apply:os.system
args: ['ls /']

The above only shows a problem with this: allowing to use any object is also a security risk; parsing random TOML documents is no longer safe (just as parsing random YAML documents isn't, unless you disable this feature which things like yaml.safe_load does in Python).


Overall, I'm not in favour of this. I would be in favour of adding durations to the specification, but that's a different issue. I might be in favour of adding some functions to the specification (such as expr() and range()), although I'm not sure about that and I'm not in favour of just adding implementation-defined free-form functions.


btw, what if someone defined a TOML document like this:

a = expr(999999999999999999999999 ^ 99999999999999999999999999999)
b = expr(999999999999999999999999 ^ 99999999999999999999999999999)
c = expr(999999999999999999999999 ^ 99999999999999999999999999999)
d = expr(999999999999999999999999 ^ 99999999999999999999999999999)

That's basically a DoS attack since computing the power of large values is computationally quite expensive.

@yyny
Copy link
Author

yyny commented May 18, 2022

You can put things in strings [...] And then the implementation can process that however it likes.

That seems like a very bad idea. Parser implementations should always treat strings as plain text, users and applications should be able to distinguish strings from implementation-defined values.

and this makes it really hard to tell if something is a valid TOML document or not, and "valid" will mean "valid according to this specific interpreter".

TOML documents with implementation-defined values could have a separate .xtoml extension, and I think that is a good idea.

But then you still have the problem "what kind of value is this?" The only way you know that the tuple key is a tuple is if your implementation either knows about the (..) syntax, or if you told the implementation "parse the tuple key as a tuple". The same applies to any other value.

That is the point of implementation-defined values. If the TOML implementation does not support them, or does not have them enabled, it should just skip over them or treat the TOML document as invalid. TOML is not a data exchange format, its mainly intended for configuration.

The purpose of the proposal is created to provide a forward- and backward-compatible way to satisfy the common need for configuration files to have minimal, readable syntax for application-specific value syntax, while continuing to co-operate with existing and future TOML infrastructure, such as TOML validators, TOML formatters, TOML parsers, TOML writers, TOML editor plugins, etc.

The above only shows a problem with this: allowing to use any object is also a security risk; parsing random TOML documents is no longer safe (just as parsing random YAML documents isn't, unless you disable this feature which things like yaml.safe_load does in Python).
btw, what if someone defined a TOML document like this: [...] That's basically a DoS attack since computing the power of large values is computationally quite expensive.

The TOML implementation does not do anything with the value, it simply treats it as a list of tokens and lets the applications choose what to do with them. I hope the proposal made it sufficiently clear that the feature is very much intended to be opt-in. If an application decides to define or enable syntax that runs arbitrary computations like this, then that is on them, but at that point they probably have a compelling reason to do so. The same applies for many other uses of implementation-defined values, such as including files, inheriting parts of the TOML document recursively, substituting environment variables and strings, etc. These are all features which quite obviously dont belong in the TOML spec and should not be enabled by default, but which nontheless are very valuable for many applications of TOML.

@arp242
Copy link
Contributor

arp242 commented May 18, 2022

You can put things in strings [...] And then the implementation can process that however it likes.

That seems like a very bad idea. Parser implementations should always treat strings as plain text, users and applications should be able to distinguish strings from implementation-defined values.

Anything can have a whole lot of different semantic meanings; "plain text" is often not "just plain text" (or "just a number"); for example:

filter-regexp = "[a-zA-Z]+"

error-email = "[email protected]"

download-dir = "~/Downloads"

history-file = "~/.myapp/history"

new-file-permission = 755

complaint-surcharge = 100

These are all different "types" (regexp, email, directory, file, unix permission bits, monetary unit).

I don't really see the difference between treating a string with the semantic meaning of an email address or directory as any different from treating it with the semantic meaning of an expression or something else.

@eksortso
Copy link
Contributor

With all due respect to @yyny, I believe that any considerations of validation, syntax highlighting, security, or even practicality of implementation-specific modifications that fall outside of the TOML standard should be handled solely by the implementing project alone.

We should not open ourselves up to these things. Either a TOML implementation is compliant with a given version of TOML, or it is not. Full stop.

If they are not compliant, they should describe how they are "nearly compliant" or "almost compliant", but the burden falls on them to describe their variations. They can do so in their READMEs, and many specific implementations already exist with such caveats. You can review different implementations via the wiki.

Parser writers have a lot of flexibility to make what suits their needs, and they could even come up with valuable extensions that may ultimately find their way into the TOML standard. But compliance is up to us to define. Non-standard variations are not. They are their creators' burdens first and foremost.

No changes need to be made to the standard to express this sentiment.

@CheaterCodes
Copy link

CheaterCodes commented Oct 12, 2022

I've been busy on and off during the last months writing TOML parsers using different approaches.
While this isn't my biggest annoyance with the implementation, I do feel the overhead of the native DateTime support.
It have never actually seen it being used in the wild and it is fairly annoying to parse.
I am afraid that future TOML versions are dropping the M, but I understand that custom types can be very useful.

As such, I would like to propose prefixed strings. They are essentially strings prefixed with an unquoted key, e.g.:

time = t"10:32:00.555"
offset = t"1987-07-05 17:45:56+13:00"
regex = r"[a-zA-Z]+"

For existing fully compliant parsers, supporting this should be trivial.
Smaller parsers can be compliant in the sense that they accept files with this syntax, but they don't need to actually parse the value.
This syntax can easily be reused for other custom types like regex in the example above.

Note that this is different to just using plain strings, since it gives semantic meaning to the string contents which is useful for e.g. syntax highlighting.

Edit: Just noticed #603 which is related I suppose.

@eksortso
Copy link
Contributor

eksortso commented Oct 13, 2022

@CheaterCodes Strings are a cheap fallback for many implementation problems. The semantics of your strings ultimately depend upon the application though. Tags like those in #603 could prove useful for microformats or syntax highlighting someday. More work needs done on this front though. Since you're writing your own parsers, try tags for yourself like e.g. (tel) for phone numbers, and let us know how it works out.

I don't think something like a t"" syntax would be very popular. But your issue with the DateTime data types you're using is not a problem for the TOML standard to solve. That's a problem that good parsers can solve with a better choice of data types. Parsers don't have to use rarely-touched classes, especially when simpler data types would serve users better. And with solid work in place, you can promote simpler approaches with DateTimes instead of falling back on strings and parsing things twice.

As for regexes, just use single-quote literal strings. That's what they're for. I know that you picked r"" (raw string syntax) from Python for your example, but TOML already has literal quotes for the same purpose. I've used both ' and ''' in my own projects, and their use has been almost exclusively for complicated regexes in my work. Again, YMMV. Tags probably won't help you here, since there are at least a dozen different regex dialects in the wild, and the dialect that you support for a hypothetical (regex) tag will be implementation-specific.

# TOML v1.0.0
time = 10:32:00.555
offset = 1987-07-05 17:45:56+13:00
regex = '[a-zA-Z]+'

@arp242
Copy link
Contributor

arp242 commented Oct 13, 2022

The problem was never with syntax anyway, but with the semantics of it all. What if an implementation doesn't support regular expressions? What should regex = r"[a-zA-Z]+" do then? How do we even know that r".." denotes a regexp? Plus all the other caveats and problems mentioned in this thread.

The syntax is unimportant until the semantics can be figured out.

@CheaterCodes
Copy link

I guess my explanation just completely missed the point, my bad.

So I agree, tags work just fine, syntax doesn't matter.
But my problem with TOML is that there's too much syntax. There are currently 4 different variants of strings, 5 if you include date-time (or 8 if you count all variants).This is a in my opinion unnecessary burden on parsers with little benefit.

As an alternative, consider tagged strings (with adjusted rules):
A string is a quote followed by any number of characters (other than quotes) and delimited by another quote. Newlines may be escaped using a backslash, quotes may be escaped as two consecutive quotes.

This provides the (simple) rules for a parser to parse any string, not matter it's contents. Tags could then be used to clarify how the content should be handled, e.g. escaped or not, treated as time, etc.

Maybe this is already too hard to read to be suitable for TOML, I might consider drafting my own format then.

@arp242
Copy link
Contributor

arp242 commented Oct 13, 2022

But my problem with TOML is that there's too much syntax. There are currently 4 different variants of strings, 5 if you include date-time (or 8 if you count all variants). This is a in my opinion unnecessary burden on parsers with little benefit.

The way strings work is not my favourite part of TOML either, but that's not going to change no matter what we decide here, because changing it would break compatibility. Also, datetimes aren't really strings; you can express them as dt = 2022-10-13.

@CheaterCodes
Copy link

because changing it would break compatibility.

Which isn't technically a problem for a potential 2.0 release, but obviously this wouldn't be my choice.

Also, datetimes aren't really strings;

Yes, but it was suggested above that simple parsers don't need to convert it to an actual date-time but can treat it as a string. Even with 4 string types my point stands, but I would still argue that datetimes fall into the same category.

@arp242
Copy link
Contributor

arp242 commented Oct 13, 2022

There are no plans for any v2 release and I would be surprised if there was ever an incompatible release. Might as well just create a new format because that's effectively the same as releasing an incompatible version of an existing format.

@CheaterCodes
Copy link

Yes, I understand and somewhat agree.

@eksortso
Copy link
Contributor

But my problem with TOML is that there's too much syntax.

If you're writing the parser for your own use, then nothing's preventing you from making it work just for yourself. Make a parser that only recognizes one string syntax and treats all date/time values as strings; write all your TOML configs (which would still be valid TOML) to use that format; and don't rely on excluded data types within your application.

Your parser won't be fully TOML-compliant, but you can clearly say how it diverges from the standard. That's better than nothing.

@pradyunsg
Copy link
Member

  • MUST have matching delimiters (e.g. ( is an error).
  • MUST NOT allow unmatched closing delimiters (e.g. ) is an error).
  • MUST NOT allow unpaired opening- and closing-delimiters (e.g. ( { ) } is an error).
  • MUST end right before a newline character when not inside unmatched delimiters.
  • MAY fail to parse when encountering a newline inside unmatched delimiters.
  • MUST have greater indentation on every line than the indentation at the start of the key for every line.
  • SHOULD have greater indentation on every line than the indentation of the equal sign.
    • MAY throw an error otherwise
  • MUST parse quoted strings as usual (even inside unmatched delimiters).
    • MUST process escapes (That is, print " " and print "\x20" are exactly the same implementation-defined token).
  • MUST parse # comments and ignore them (but not the newline after the comment).

With all these restrictions, this is strictly less useful than strings handled in a special manner would be. I'd say that expressing rich information of this form as strings is what we want people to do.

Putting the description provided into the sort of language that TOML uses in its specification (and describing these rules in the corresponding ABNF) would likely rival the length and complexity of most of the language. I don't think the restricted (conditional) flexibility afforded by this is worth that complexity.


TOML documents with implementation-defined values could have a separate .xtoml extension, and I think that is a good idea.

I don't -- at that point, it's effectively admitting that this is a new language with different considerations and guarentees. Given that nothing stops people from defining their own supersets, folks are welcome to define their own supersets already.

Parser implementations that do not need implementation-defined values can simply throw an error when they encounter one.

This is another indicator that what you're really describing is an optional superset of TOML, and that this isn't strictly needed everywhere. I don't think having conditionally parsable TOML files is a good idea.


The TOML implementation does not do anything with the value, it simply treats it as a list of tokens and lets the applications choose what to do with them.

This isn't compatible with TOML's design objective of:

be easy to parse into data structures in a wide variety of languages.


I'm going to say that this sort of conditionally-supported undefined behaviour isn't going to be added to TOML, and close this issue.

Thanks for making this proposal @yyny -- it's certainly intriguing and, while doesn't seem like a good fit for TOML, it is appreciated that you put the time to write this down in the amount of detail that you did. :)

@pradyunsg pradyunsg changed the title Proposal: Implementation-defined values Optionally supported implementation-defined values Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
@arp242 @eksortso @jaskij @pradyunsg @yyny @CheaterCodes and others