-
-
Notifications
You must be signed in to change notification settings - Fork 4
uniscript
Uniscript is a human readable and editable unicode encoding format which only uses ASCII characters to describe code points.
The constituents of uniscript are entities and block types.
Block types influencing the character stream would be
• languages (greek a => α)
• modifiers (upper A => ᴬ , italic A => 𝐴 , bold A => 𝝖 bold+italic A => 𝘼 … )
• calligraphic hands (fracture A => 𝔄 , double-struck A => 𝔸 … )
• ligature (ligature ae => æ )
• colors (red circle ○ => 🔴, brown heart ♡ => 🤎)
• mirroring (reverseInPlace e => ɘ )
• text direction (phonician a b c => 𐤂 𐤁 𐤀 )
• icons (iconic warning ⚠ =>
Uniscript entities are case sensitive
upper a => ᵃ upper A => ᴬ
The textual representation of entities and blocks in Uniscript.
Simple entities can be represented as \:
followed by the entity name:
\:infinity == ∞
The essential marker for the beginning of complex uniscript elements is "<:".
Enties are wrapped either in a single bracket of the form <:entity> or in a block of the form <:block> entities <:/block> For short sequences of entities there is an inline delineation <:type entities>
<:alpha> ⩵ α
<:fracture A> ⩵ 𝔄
<:fracture A b c > ⩵ 𝔄 𝔟 𝔠
<:fracture> A b c <:> ⩵ 𝔄 𝔟 𝔠
<:greek> a b c <:/greek> ⩵ α β ζ
blocks are closed by repeating the opening type plus a slash:
<:greek> a b c <:/greek> ⩵ α β ζ
To support interoperability with xml/html the colon in <:/greek> must NOT be omitted!
All spaces surrounding entities are only for visual appeal, are not part of the codepoint stream and will thus not be rendered in the resulting UTF-8 representation.
Unicode and fonts have conceptual overlap in font faces such as bold and italic but there are also fonts rendereing a normal A as fracture 𝔄.
In an ideal world there would be a cleaner separation between unicode entities and visual variants. This is unfortunately out of scope. With a tiny chance uniscript would stop or even undo the proliferation of codepoint1 such as ♡ => 🤎 by adding colors as unicode control characters instead of arbitrarily combining a select number of entities with a select number of colors.
https://en.wikipedia.org/wiki/Unicode_control_characters
Likewise one might reinvestige clusters such ⚠ =>
Emoji-style U+FE0F control characters are fine in principle, but should be prefixed to the following character, not subfixed.
On the other hand
IDEs may render these brackets beautifully as ⟨alpha⟩ ⩵ α ⟨fracture A⟩ ⩵ 𝔄 ⟪greek⟫ a b c ⟪/greek⟫ ⩵ α β ζ
WHY THOUGH?
List of block types:
⟪ligature⟫ ⟪fracture⟫
All entity mapping shall be defined in one human readable mapping file, which hopefully will one day evolve into a standard. Custom entity names may be defined in an extension file.
In general overlap between entity names and type names can be intentionally ambiguous yet yield the same result <:double> d <:> block type marker 'double' influencing all characters, in this case 'd' => 𝕕 <:double d> block type double or entity 'double d' ? Irrelevant for users, the result is '𝕕' <:double-d> one may write entity names unambiguously with hyphens.
Alternative names for uniscript considered but rejected (not ultimately?) were: unitext plaincode plain-code pluni-code plunicode.
Not to be confused with UTF-7.
An alternative format with the same concepts of entities and block types could be considered:
:alpha :fracture Hello :
Also revigorating and extending the HTML entity encoding format could be possible to encompass the comprehensive list of unicode codepoint entities with english names plus block type modifiers as declared above.
&ligature; ae &end-ligature;
So when using uniscript the encoding always needs to be explicit.
For example, in the future instead of tagging web pages with One might use .
All texts containing "<:" as a character sequence not inteded as control signal need to encode it (similar to & within html entities).
One proper encoding of "<:" would be <:less>: or <:<> or <<:colon> or <<::>
Usually free standing "<" characters need NOT be encoded as <:less> because only the combination of "<:" forms a uniscript control signal. Likewise the character ">" NEVER needs to be encoded as <:greater> because ">" does not influence the unicode control flow except as closing entity/block marker AFTER the "<:" marker.
Since entity names are ascii only, there is no difficulty in parsing <:alpha> > <:beta> as α > β
Entities are similar to HTML but use a different encoding <:alpha> vs α HTML entities with cryptic names (𝕕 𝕕 ) are supported for backwards compatibility but are strongly discouraged. Uniscript entities are much more comprehensive and all cryptic abbreviations have one ore more equivalent descriptive long english entity names. For example 𝕕 𝕕 has unicode entity name <:double d>
⟪ U+027EA ⟪ entity ⟫ U+027EB ⟫ entity ⟨ U+027E8 ⟨ entity ⟩ U+027E9 ⟩ entity
&fr; &fracture; &opf; &???; 𝕕 U+1D555 𝕕 entity
&DoubleType; 𝕕 ¨ ⇓ …
À U+000C0 À entity à U+000E0 à entity ã U+000E3 ã entity ≔ U+02254 ≔ entity * U+0002A * entity ∧ U+02227 ∧ entity ∠ U+02220 ∠ entity æ U+000E6 æ entity
ℵ U+02135 ℵ entity α U+003B1 α entity ∵ U+02235 ∵ entity
⨀ U+02A00 ⨀ entity ⨁ U+02A01 ⨁ entity ⨂ U+02A02 ⨂ entity ⨆ U+02A06 ⨆ entity ★ U+02605 ★ entity ⋁ U+022C1 ⋁ entity ⋀ U+022C0 ⋀ entity █ U+02588 █ entity
▪ U+025AA ▪ entity ▴ U+025B4 ▴ entity ␣ U+02423 ␣ entity NOT BLANK;)
⊥ U+022A5 ⊥ entity • U+02022 • entity · U+000B7 · entity
✓ U+02713 ✓ entity ✓ U+02713 ✓ entity
⊖ U+02296 ⊖ entity ⊕ U+02295 ⊕ entity ⊗ U+02297 ⊗ entity
♣ U+02663 ♣ entity ♣ U+02663 ♣ entity ∷ U+02237 ∷ entity : U+0003A : entity © U+000A9 © entity ⨯ U+02A2F ⨯ entity ∪ U+0222A ∪ entity ‐ U+02010 ‐ entity ° U+000B0 ° entity
⋄ U+022C4 ⋄ entity ♦ U+02666 ♦ entity
÷ U+000F7 ÷ entity ÷ U+000F7 ÷ entity
$ U+00024 $ entity
¨ U+000A8 ¨ entity ⇓ U+021D3 ⇓ entity ¨ U+000A8 ¨ entity ˙ U+002D9 ˙ entity
↓ U+02193 ↓ entity
ð U+000F0 ð entity ∃ U+02203 ∃ entity ∃ U+02203 ∃ entity
∀ U+02200 ∀ entity
½ U+000BD ½ entity ½ U+000BD ½ entity …
♥ U+02665 ♥ entity ♥ U+02665 ♥ entity ‐ U+02010 ‐ entity
∈ U+02208 ∈ entity
∫ U+0222B ∫ entity
U+02062 entity
κ U+003BA κ entity λ U+003BB λ entity
open questions :
• Should partial entity names be completed by the IDE or also be allowed in uniscript :nat :hyph :alp ?