Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Wed Jul 19 05:53:14 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> To be quite honest, I wouldn't care about that possibility. If I could
> design regex semantics purely from an idealistic POV, I would say that
> [xyzã], regardless of its encoding, will match any of the four
> characters "x", "y", "z", "ã".
>
> Earlier I posted a suggestion that a folding function be used when
> searching (for instance, it can case fold, NFKC normalize, etc).
> Unfortunately, this makes positional matching extremely tricky; if
> normalization changes the number of code points in the string, you
> have some fiddly work to do to try to find back the match location in
> the original (pre-folding) string. That technique works well for
> simple lookups (eg "find me all documents whose titles contain <this
> string>"), but a regex does more than that. As such, I am in favour of
> the regex engine defining a "character" as a base with all subsequent
> combining, so a single dot will match the entire combined character,
> and square bracketed expressions have the same meaning whether you're
> NFC or NFD normalized, or not normalized. However, that's the ideal
> situation, and I'm not sure (a) whether it's even practical to do
> that, and (b) how bad it would be in terms of backward compatibility.

Here's a proposal:

   * introduce a building (predefined) class Text

   * conceptually, a Text object is a sequence of "real" characters

   * you can access each "real" character by its position in O(1)

   * the "real" character is defined to be a integer computed as follows
     (in pseudo-Python):

      string = the NFC normal form of the real character as a string
      rc = 0
      shift = 0
      for codepoint in string:
          rc |= ord(codepoing) << shift
          shift += 6
      return rc

    * t[n] evaluates to an integer

    * the Text constructor takes a string or an integer

    * str(Text) evaluates to the NFC encoding of the Text object

    * Text.encode(...) works like str(Text).encode(...)

    * regular expressions work with Text objects

    * file system functions work with Text objects


Instead of introducing Text, all of this could also be done within the
str class itself:

   * conceptually, an str object is a sequence of integers representing
     Unicode code points *or* "real" characters

   * ord(s) returns the code point or the integer (rc) from the
     algorithm above

   * chr(n) takes a valid code point or an rc value as defined above

   * s.canonical() returns a string that has merged all multi-code-point
     characters into single "real" characters


Each approach has its upsides and downsides.


Marko



More information about the Python-list mailing list