Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Sun Jul 16 01:53:39 EDT 2017


On Sun, Jul 16, 2017 at 2:25 PM, Rick Johnson
<rantingrickjohnson at gmail.com> wrote:
> But the two "realms" and two "character types" are but only a
> small sample of the syntactical complexity of Python
> strings. For we haven't even discussed the many types of
> string literals that Python defines. Some include:
>
>     (1) "Normal Strings"
>     (2) r"Raw Strings
>     (3) b"Byte Strings"
>     (4) u"Unicode Strings"
>     (5) ru"Raw Unicode"
>     (6) ur'Unicode "that is _raw_"'
>     (7) f"Format literals"
>     ...
>
> Whew!

There are only two types of *string objects* in Python: Unicode
strings and byte strings. All the above are just ways of encoding
those in your source code. That's all. (And f-strings aren't really
strings, but expressions.)

There is only one type of *integer object* in Python, yet there are
many forms of literal:

* decimal - 1234
* octal - 0o2322
* hexadecimal - 0x4d2
* binary - 0b10011010010
* the above, with separation - 1_234, 0b100_1101_0010, etc

None of this has anything to do with the current discussion.
*ANYTHING*. Please do not introduce red herrings.

> Chris was arguing that zero width spaces should not be
> counted as characters when the `len()` function is applied
> to the string, for which i disagree on the basis of
> consistency. My first reaction is: "Why would you inject a
> char into a string -- even a zero-width char! -- and then
> expect that the char should not affect the length of the
> string as returned by `len`?"

Did you read my emails? I was never arguing that.

> Being that strings (on the highest level) are merely linear
> arrays of chars, such an assumption defies all logic.
> Furthermore, the length of a string (in chars) and the
> "perceived" length of a string (when rendered on a screen,
> or printed on paper), are in no way relevant to one another.

"chars" meaning what? We still don't have any definition of
"character" here. In Python, strings are arrays of code points.

> [1] Of course, even in the realms of ASCII, there are chars
> that cannot be inserted by the programmer _simply_ by
> pressing a single key on the keyboard. But most of these
> chars were useless anyways. So we will ignore this small
> detail for now. One point to mention is that Unicode
> greatly increased the number of useless chars.

Define "useless".

ChrisA



More information about the Python-list mailing list