Skip to content

Commit

Permalink
Clean up the use of double angular brackets
Browse files Browse the repository at this point in the history
  • Loading branch information
RaimoNiskanen committed Nov 17, 2023
1 parent 025f666 commit 43f96cb
Showing 1 changed file with 63 additions and 65 deletions.
128 changes: 63 additions & 65 deletions eeps/eep-0066.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ Design Decisions
----------------

In the following text double angle quotation marks are used to
mark source code characters in a paragraph. For example:
«`.`» means the dot character (full stop).
mark source code characters to improve clarity.
For example: the dot character (full stop): «`.`».

### Erlang Language Structure (Tokenizer and Parser)

Expand Down Expand Up @@ -76,9 +76,9 @@ much state and looks just a few fixed number of characters ahead
in the input.

For example; from the start state, if the tokenizer sees
a «`'`» character, it switches state to scanning a quoted atom.
While doing so it translates escape sequences such as «`\n`»
(into ASCII 10) and when it sees a «`'`» character it produces
a `'` character, it switches state to scanning a quoted atom.
While doing so it translates escape sequences such as `\n`
(into ASCII 10) and when it sees a `'` character it produces
an atom token and goes back to the start state.

### Problems with simple prefixes
Expand All @@ -92,7 +92,7 @@ The tokenizer would have to know of all combinations of prefix characters
and emit distinct tokens for every combination.

Today, the character sequence «`b`», «`f`», «`"`» is scanned as a token
for the atom «`bf`» followed by the string start token «`"`».
for the atom `bf` followed by the string start token `"`.
That combination fails in the parser so it is syntactically invalid today,
which is what makes simple prefixes a possible language extension.

Expand All @@ -107,30 +107,30 @@ Furthermore, it is likely that we want the feature of choosing

re(^"+.*/.*$)

Among the desired delimiters are «`/`» and «`<`»+«`>`». The currently
valid code «`b<X`» meaning atom «`b`» less than «`X`», would instead
have to be interpreted as prefixed string start «`b<`» with «`X`»
Among the desired delimiters are `/` and `< >`. The currently
valid code «`b<X`» meaning atom `b` less than `X`, would instead
have to be interpreted as prefixed string start `b<` with `X`
being the first string content character.

For the «`/`» character we run into similar problems with for example
For the `/` character we run into similar problems with for example
«`b/X`», which would be a run-time error today, but if we also would
want capital letter prefixes, then «`B/X`» is perfectly valid today
but would become a string start.

There are more likely problems with simple string prefixes:
«`#bf{`» is today the start of a record named «`bf`», and is
scanned as punctuation character «`#`», atom «`bf`» and separator «`{`»,
which the parser sorts out to be a record start.
«`#bf{`» is today the start of a record named `bf`, and is
scanned as punctuation character `#`, atom `bf` and separator `{`,
which the parser figures out to be a record start.

With simple prefix characters the tokenizer would have to be rewritten
to recognize «`#bf`» as a new record token, a rewrite that might cause
unexpected changes in record handling. For example, today, «`# bf {`»
is also a valid record start, so to be completely compatible the tokenizer
is also a valid record start, so to be compatible the tokenizer
would have to allow white-space or even newlines within the new record
token, between «`#`» and the atom characters, which would be really ugly...
token, between `#` and the atom characters, which would be really ugly...

For other reasons, namely that function call parenthesis are optional,
Elixir has chosen to use the «`~`» character as the start of
Elixir has chosen to use the `~` character as the start of
a string prefix which they call a "[Sigil][1]".

Having a distinct start character for this feature simplifies
Expand All @@ -139,39 +139,39 @@ tokenizing and parsing.
### Sigil

In a general sense, a [Sigil][3], is a prefix to a variable
that indicates its *type*, such as «`$I`» in Basic or Perl,
where «`$`» is the sigil and «`I`» is the variable.
that indicates its *type*, such as `$I` in Basic or Perl,
where `$` is the sigil and `I` is the variable.

Here we define a Sigil as a prefix (and a suffix) to a string literal
that indicates how it should be *interpreted*. The Sigil is
a *syntactic sugar* that creates some Erlang term.

A Sigil string literal consists of:

1. The [Sigil Prefix][], «`~`» followed by a name that may be empty.
1. The [Sigil Prefix][], `~` followed by a name that may be empty.
2. The [String Content][] within [String Delimiters][].
3. The [Sigil Suffix][], a name character sequence that may be empty.

A Sigil looks like a string with a prefix (and maybe a suffix),
but expands to some term (or expression), so it cannot be subject
to the string concatenation the parser does.

Therefore `"abc" "def"` is `"abcdef"` but `~s"abc" "def"`
Therefore «`"abc" "def"`» is `"abcdef"` but «`~s"abc" "def"`»
should be illegal, and also all other sequences consisting
of a Sigil of any type, and any other term, in any order.

### Sigil Prefix

The Sigil Prefix starts whith the Tilde character «`~`», followed
The Sigil Prefix starts whith the Tilde character `~`, followed
by the Sigil Type which is a name composed of a sequence of characters
that are allowed as the second or later characters in a variable or an atom.
In short ISO [Latin-1][] letters, digits, «`_`» and «`@`».
In short ISO [Latin-1][] letters, digits, `_` and `@`.
The Sigil Type may be empty.

The Sigil Type defines how the [Sigil][] syntactic sugar
shall be interpreted. The suggested Sigil Types are:

* «»: the vanilla (default) [Sigil][].
* «»: the vanilla (default (empty name)) [Sigil][].

Creates an Erlang `unicode:unicode_binary()`.
It is a string represented as a UTF-8 encoded binary,
Expand All @@ -191,44 +191,44 @@ shall be interpreted. The suggested Sigil Types are:
the first and most desired missing string feature in Erlang.
This sigil does just that.

* «`b`»: `unicode:unicode_binary()`
* `b`: `unicode:unicode_binary()`

Creates a UTF-8 encoded binary, handling escape characters
in the string content. Other features such as string interpolation
will require another Sigil Type or using the [Sigil Suffix][].

In Elixir this corresponds to the «`~s`» sigil, a [string][4].
In Elixir this corresponds to the `~s` sigil, a [string][4].

* «`B`»: `unicode:unicode_binary()`, verbatim.
* `B`: `unicode:unicode_binary()`, verbatim.

Creates a UTF-8 encoded binary, with verbatim string content.
The content ends when the end delimiter is found.
There is no way to escape the end delimiter.

In Elixir this corresponds to the «`~S`» sigil, a [string][4].
In Elixir this corresponds to the `~S` sigil, a [string][4].

* «`s`»: `string()`.
* `s`: `string()`.

Creates a Unicode codepoint list, handling escape characters
in the string content. Other features such as string interpolation
will require another Sigil Type or using the [Sigil Suffix][].

In Elixir this corresponds to the «`~c`» sigil, a [charlist][5].
In Elixir this corresponds to the `~c` sigil, a [charlist][5].

* «`S`»: `string()`, verbatim.
* `S`: `string()`, verbatim.

Creates a Unicode codepoint list, with verbatim string content.
The content ends when the end delimiter is found.
There is no way to escape the end delimiter.

In Elixir this corresponds to the «`~C`» sigil, a [charlist][5].
In Elixir this corresponds to the `~C` sigil, a [charlist][5].

* «`R`»: regular expression.
* `R`: regular expression.

This EEP proposes to not implement regular expressions yet.
It is still unclear how integration with the `re` module
should be done, and if it is worth the effort compared
to just using the «`S`» or the «`B`» Sigil Type.
to just using the `S` or the `B` Sigil Type.

The best idea so far was that this sigil creates a term
`{re,RE::unicode:charlist(),Flags::[unicode:latin1_char()]}`
Expand All @@ -245,7 +245,7 @@ shall be interpreted. The suggested Sigil Types are:
the regular expression rules.

The main advantage of a regular expression [Sigil][] is to avoid
the additional escaping of «`\`» that regular erlang strings require.
the additional escaping of `\` that regular erlang strings require.

Today: `re:run(Subject, "^\\s*\"[a-z]+\\\\\\d+\"", [caseless,unicode])`

Expand All @@ -264,9 +264,9 @@ since they are often a source for hard to find problems.

These proposed Sigil Types are named according to the corresponding
Erlang types. The Sigil Types in [Elixir][1] are named according to
Elixir types. So, for example, a «`~s`» Sigil Prefix in Erlang
Elixir types. So, for example, a `~s` Sigil Prefix in Erlang
creates an Erlang `string()`, which is a list of Unicode codepoints,
but in Elixir the «`~s`» Sigil Prefix creates an Elixir [String][4]
but in Elixir the `~s` Sigil Prefix creates an Elixir [String][4]
which is a UTF-8 encoded binary.

Consistency within the language is supposedly more important
Expand All @@ -280,13 +280,13 @@ A specific start delimiter character has a corresponding
end delimiter character.

The allowed start-end delimiter character pairs are:
«`()`», «`[]`», «`{}`», «`<>`» and «`«»`».
`( ) [ ] { } < > « »`.

The following characters are start delimiters that have themselves
as end delimiters: «`/`», «`|`», «`'`», «`"`» and «`#`».
as end delimiters: `/ | ' " #`.

Triple-quote delimiters are also allowed, that is; a sequence of
3 or more double quote «`"`» characters as described in [EEP 64][].
3 or more double quote `"` characters as described in [EEP 64][].

For a given [Sigil Type][] except the [Vanilla Sigil][],
which String Delimiters that are used does not affect how
Expand All @@ -297,20 +297,19 @@ doesn't occur in the string's content, so interpreting the string content
does not interfere with finding the end delimiter.

The proposed set of delimiters is the same as in [Elixir][1],
plus «`«»`» and «`#`». They are the characters in [Latin-1][]
plus `« »` and `#`. They are the characters in [Latin-1][]
that are normally used for bracketing or text quoting,
and those that feel like full height vertikal lines.
Except: «`\`» is too often used for character escaping,
«`` `» and «`´`» look too much like «`'`»,
«`¦`» looks too much like «`|`», and «`#`» is too useful
to *not* include since it in many contexts (shell scripts,
Perl regular expressions) it is a comment character than
is easy to avoid in the [String Content][].

It may not be obvious how to type the «`«`» and «`»`» characters
and those that feel like full height vertikal lines,
except: `\` is too often used for character escaping,
`` ` `` and `´` look too much like `'`, `¦` looks too much like `|`,
and `#` is too useful to *not* include since it in many contexts
(shell scripts, Perl regular expressions) it is a comment character
that is easy to avoid in the [String Content][].

It may not be obvious how to type the `«` and `»` characters
on some keyboards (US), but there *are* ways that should not
hinder a determined programmer. When using X Compose sequences
it is simply [`Compose`] [`<`] [`<`] and [`Compose`] [`>`] [`>`].
discourage a determined programmer. When using X Compose sequences
it is simply [Compose] [<] [<] and [Compose] [>] [>].

### String Content

Expand All @@ -322,7 +321,7 @@ of indentation and leading and trailing newline is done as usual
as described in [EEP 64][].

In a string with single character [String Delimiters][],
normal Erlang escape sequences prefixed with «`\`» are honoured,
normal Erlang escape sequences prefixed with `\` are honoured,
as usual for regular Erlang strings and quoted atoms

A specific [Sigil Type][] can have it's own character escaping rules,
Expand All @@ -338,10 +337,10 @@ of name characters.

The Sigil Suffix may indicate how to interpret the String Content,
for a specific [Sigil Type][].
For example; for the «`~R`» [Sigil Prefix][] (regular expression),
For example; for the `~R` [Sigil Prefix][] (regular expression),
the Sigil Suffix is interpreted as short form compile options
such as «`i`» that makes the regular expression character
case insensitive. For example `~R/^from: /i`.
case insensitive. For example «`~R/^from: /i`».

Things that may have to be performed by the tokenizer, such as
how to handle escape character rules, should not be affected
Expand Down Expand Up @@ -419,38 +418,37 @@ should represent an *uncompiled* regular expression with compile flags.

### Comparison with Elixir

The [Vanilla Sigil][] (empty [Sigil Type][]) is not allowed in Elixir.
There is no [Vanilla Sigil][] (empty [Sigil Type][]) in Elixir.

This EEP proposes to add the following [String Delimiters][]
to the set that Elixir has: «`«»`» and «`#`».
to the set that Elixir has: `« » #`.

The string and binary [Sigil Type][]s are named differently
between the languages, to keep the names consistent within
the language (Erlang): «`~s`» in Elixir is «`~b`» in Erlang,
and «`~c`» in Elixir is «`~s`» in Erlang, so «`~s`» means
the language (Erlang): `~s` in Elixir is `~b` in Erlang,
and `~c` in Elixir is `~s` in Erlang, so `~s` means
different things, because strings are different things.

When Elixir allows escape sequences in the [String Content][]
it also allows string interpolation. This EEP proposes to *not*
implement string interpolation in the suggested [Sigil Type][]s.


When Elixir doesn't allow escape sequences in the [String Content][],
it still allows escaping the end delimiter. This EEP proposes
that such strings should be truly verbatim whith no possibility
to escape the end delimiter.

There are small differences in which escape sequences that are implemented
in the languages; Elixir allows escaping of newlines, and has
an escape sequence «`\a`», that Erlang does not have.
an escape sequence `\a`, that Erlang does not have.

There are also small differences in how newlines are handled
between «`~S`» heredocs in Elixir and triple-quoted strings in Erlang.
between `~S` heredocs in Elixir and triple-quoted strings in Erlang.
See [EEP 64][].

Details about regular expression sigils, «`~R`», in particular
Details about regular expression sigils, `~R`, in particular
their [Sigil Suffix][]es remains to be decided in Erlang.
Also, there is a question about escaping the end delimiter or not.
Also, there still is a question about escaping the end delimiter or not.

It has not been decided how or even *if* string interpolation
will be implemented in Erlang, but a [Sigil Suffix][] or
Expand All @@ -459,8 +457,8 @@ new [Sigil Type][]s would most probably be used.
Reference Implementation
------------------------

[PR-7684][] Implements the «`s`», «`S`», «`b`», «`B`»
and the «``» (vanilla) Sigil, according to this EEP.
[PR-7684][] Implements the `~s`, `~S`, `~b`, `~B`
and the `~` (vanilla) Sigil, according to this EEP.

The tokenizer produces a `sigil_prefix` token before the string literal,
and a `sigil_suffix` token after. The parser merges and transforms them
Expand Down

0 comments on commit 43f96cb

Please sign in to comment.