python 2.7 and unicode (one more time)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Nov 22 08:50:29 EST 2014


Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> 
>> In Python, we have Unicode strings and byte strings.
> 
> No, you don't. You have strings and bytes:

Python has strings of Unicode code points, a.k.a. "Unicode strings",
or "text strings", and strings of bytes, a.k.a. "byte strings". These are
the plain English descriptive names of the types "str" and "bytes". 


>   Textual data in Python is handled with str objects, or strings.
>   Strings are immutable sequences of Unicode code points. String
>   literals are written in a variety of ways: [...]

Hence, Unicode string.


>   <URL: https://docs.python.org/3/library/stdtypes.html#text-sequence-typ
>   e-str>
> 
>   The core built-in types for manipulating binary data are bytes and
>   bytearray.

Which are strings of bytes.


>   <URL: https://docs.python.org/3/library/stdtypes.html#binary-sequence-t
>   ypes-bytes-bytearray-memoryview
> 
> 
> Equivalently, I wouldn't mind "character strings" vs "byte strings".

Unicode strings are not strings of characters, except informally. Some code
points represent non-characters:

http://www.unicode.org/faq/private_use.html#nonchar1

They are strings of Unicode code points, but "code point string" is firstly
an inelegant and ugly phrase, and secondly ambiguous. What sort of code
points? Baudot codes? ASCII codes? Big5 codes? Tron codes? No, none of the
above, they are *Unicode* code points.

You haven't given any good reason for objecting to calling Unicode strings
by what they are. Maybe you think that it is an implementation detail, and
that some version of Python might suddenly and without warning change to
only supporting KOI8-R strings or GB2312 strings? If so, you are badly
mistaken. The fact that Python strings are Unicode is not an implementation
detail, it is part of the language semantics.


> Unicode strings is not wrong but the technical emphasis on Unicode is as
> strange as a "tire car" or "rectangular door" when "car" and "door" are
> what you usually mean.

"Tire car" makes no sense. "Rectangular door" makes perfect sense, and in a
world where there are dozens of legacy non-rectangular doors, it would be
very sensible to specify the kind of door. Just as we specify sliding door,
glass door, security door, fire door, flyscreen wire door, and so on.



-- 
Steven




More information about the Python-list mailing list