Merge branch 'main' of github.com:propensive/kaleidoscope

propensive · Jan 9, 2025 · 14f3cea · 14f3cea
2 parents 5ff35bb + 5d3de60
commit 14f3cea
Showing 1 changed file with 81 additions and 12 deletions.
diff --git a/.github/readme.md b/.github/readme.md
@@ -88,12 +88,12 @@ def validate(email: Text): Optional[Text] = email match
 ```
 
 Such patterns will either match or not, however should they match, it is
-possible to extract parts of the matched string using capturing groups. The
+possible to extract parts of the matched string using _capturing groups_. The
 pattern syntax is exactly as described in the [Java Standard
 Library](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),
 with the exception that a capturing group (enclosed within `(` and `)`) may be
 bound to an identifier by placing it, like an interpolated string substitution,
-immediately prior to the capturing group, as `$identifier` or `${identifier}`.
+immediately before to the capturing group, as `$identifier` or `${identifier}`.
 
 Here is an example of using a pattern match against filenames:
 ```scala
@@ -106,8 +106,8 @@ def identify(path: Text): FileType = path match
   case r"/styles/$styles(.*)" => FileType.Stylesheet(styles)
 ```
 
-Alternatively, as with patterns in general, this can be extracted directly in a
-`val` definition.
+Alternatively, as with patterns in general, these can be extracted directly in a
+`val` definition (though this is common usage).
 
 Here is an example of matching an email address:
 ```scala
@@ -133,9 +133,11 @@ compile-time, and any issues will be reported then.
 A normal, _unitary_ capturing group, like `domain` and `tld` above, will
 extract into `Text` values. But if a capturing group has a repetition suffix,
 such as `*` or `+`, then the extracted type will be a `List[Text]`. This also
-applies to repetition ranges, such as `{3}`, `{2,}` or `{1,9}`.
+applies to repetition ranges, such as `{3}` (exactly three times), `{2,}`
+(at least twice) or `{1,9}` (at least once, and at most nine times). Note that `{1}`
+will still extract a `Text` value, not a `List[Text]`.
 
-Note that `{1}` will still extract a `Text` value. The type is determined
+The type of each captured group is determined
 statically from the pattern, and not dynamically from the runtime scrutinee.
 
 A capture group may be marked as optional, meaning it can appear either zero or
@@ -144,30 +146,97 @@ if it present it will be a `Text` value, and if not, it will be `Unset`.
 
 For example, see how `init` is extracted as a `List[Text]`, below:
 ```scala
-import gossamer.{drop, Rtl}
+import gossamer.{skip, Rtl}
 
 def parseList(): List[Text] = "parsley, sage, rosemary, and thyme" match
   case r"$only([a-z]+)"                      => List(only)
   case r"$first([a-z]+) and $second([a-z]+)" => List(first, second)
-  case r"$init([a-z]+, )*and $last([a-z]+)"  => init.map(_.drop(2, Rtl)) :+ last
+  case r"$init([a-z]+, )*and $last([a-z]+)"  => init.map(_.skip(2, Rtl)) :+ last
 ```
 
 ### Escaping
 
-Note that inside an extractor pattern string, whether it is single- (`r"..."`)
-or triple-quoted (`r"""..."""`), special characters, notably `\`, do not need
-to be escaped, with the exception of `$` which should be written as `$$`.
+Escaping happens at two levels between source code and regular expression.
+First when source code is interpreted as a string. And again when that string
+is interpreted as a regular expression pattern.
+
+This is particularly apparent when pattern matching a single backslash (`\`) in Java:
+we must write `java.util.regex.Pattern.compile("\\\\")`. The backslash in the
+regular expression needs to be escaped with another backslash; then _each_ of those
+backslashes must be escaped in order to embed it in a string.
+
+The situation is improved in Kaleidoscope patterns, written as single- (`r"..."`)
+or triple-quoted (`r"""..."""`) interpolated strings: special characters,
+notably `\`, do _not_ need to be escaped. This means that only the regular expression
+escaping rules need to be considered.
+
+An exception is `$` (which is used to indicate a substitution) and should be
+written as `$$`.
 
 It is still necessary, however, to follow the regular expression escaping
 rules, for example, an extractor matching a single opening parenthesis would be
 written as `r"\("` or `r"""\("""`.
 
+### `Regex` values
+
+Regular expressions may be defined as values outside of pattern matches, using the
+same syntax. For example,
+```scala
+val IpAddress = r"([0-9]+\.){3}[0-9]+"
+```
+which may then be used in a pattern,
+```scala
+input match
+  case IpAddress(addr) => addr
+```
+but can only represent an entire match, and cannot extract capturing groups.
+
+Such values are instances of `Regex` and provide access to the source pattern as a
+`Text` value, `Regex#pattern`, as well as the position and nature of any groups
+within the pattern.
+
+`Regex`s are used in other Soundness libraries wherever a regular expression is
+required. More importantly, `Text` is _never_ used for a value that
+represents a regular expression. So it's possible to know from a method's
+signature whether its parameter is interpreted as a regular expression or as
+a direct string.
+
+Indeed, in Gossamer, the method `sub` (for making substitutions in a textual
+value) is overloaded to take either a `Regex` or a `Text` parameter, and to behave
+accordingly.
+
 ## Globs
 
-Globs offer a simplified and limited form of regular expression. You can use
+Globs offer a simplified but limited form of regular expression. You can use
 these in exactly the same way as a standard regular expresion, using the
 `g"..."` interpolator instead.
 
+For example,
+```scala
+path match
+  case g"/usr/local/bin/$name"     => name
+  case g"/home/*/.local/bin/$name" => name
+```
+
+The appearance of a `*` in a glob will match any sequence of characters,
+except for `/` and `\`. A `?` will match exactly one character. An extractor,
+such as `$name` above, is equivalent to `*`, but binds the value to an identifier.
+
+As an expression (rather than a pattern), and interpolated string, `g""`, will
+also produce a `Regex` value, and can be used anywhere a `Regex` is valid. In
+fact, globs are implemented as a simpler front-end to regular expressions. So it
+would be possible to write, `path.sub(g"/home/*/.local", t"/usr/local")` _or_
+`path.sub(r"/home/[^/]*/.local", t"/usr/local")` to achieve the same goal.
+
+### Equality
+
+It is even possible, sometimes, to equate regular expressions and globs. For
+example, `g".local/*" == r"\.local/[^/\\]*"` returns `true` because they are
+represented by identical underlying patterns. However, the inequality of two
+`Regex` instances does not necessarily indicate any difference in the behavior
+of two `Regex`s: it may be impossible to find any input where one matches and the
+other does not, while they are implemented differently. (As a trivial example,
+consider `r"[xy]"` and `r"[yx]"`: non-equal `Regex`s, with identical behavior.)
 
 
 ## Status