Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Wed Jul 19 08:56:49 EDT 2017


On Wed, Jul 19, 2017 at 10:13 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Wed, Jul 19, 2017 at 7:53 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>> Here's a proposal:
>>>
>>>    * introduce a building (predefined) class Text
>>>
>>>    * conceptually, a Text object is a sequence of "real" characters
>>>
>>>    * you can access each "real" character by its position in O(1)
>>>
>>>    * the "real" character is defined to be a integer computed as follows
>>>      (in pseudo-Python):
>>>
>>>       string = the NFC normal form of the real character as a string
>>>       rc = 0
>>>       shift = 0
>>>       for codepoint in string:
>>>           rc |= ord(codepoing) << shift
>>>           shift += 6
>>>       return rc
>>>
>>>     * t[n] evaluates to an integer
>>
>> A string could consist of 1 base character and N-1 combining
>> characters. Can you still access those combined characters in constant
>> time?
>
> Yes.

Perhaps we don't have the same understanding of "constant time". Or
are you saying that you actually store and represent this as those
arbitrary-precision integers? Every character of every string has to
be a multiprecision integer?

ChrisA



More information about the Python-list mailing list