Everything you did not want to know about Unicode in Python 3

Steven D'Aprano steve at pearwood.info
Tue May 13 05:44:28 EDT 2014


On Tue, 13 May 2014 12:06:50 +0300, Marko Rauhamaa wrote:

> Chris Angelico <rosuav at gmail.com>:
> 
>> These are problems that Unicode can't solve.
> 
> I actually think the problem has little to do with Unicode. Text is an
> abstract data type just like any class. If I have an object (say, a
> subprocess or a dictionary) in memory, I don't expect the object to have
> any existence independently of the Python virtual machine. I have the
> same feeling about Py3 strings: they only exist inside the Python
> virtual machine.

And you would be correct. When you write them to a device (say, push them 
over a network, or write them to a file) they need to be serialized. If 
you're lucky, you have an API that takes a string and serializes it for 
you, and then all you have to deal with is:

- am I happy with the default encoding?

- if not, what encoding do I want?

Otherwise you ought to have an API that requires bytes, not strings, and 
you have to perform your own serialization by encoding it.

But abstractions leak, and this abstraction leaks because *right now* 
there isn't a single serialization for text strings. There are HUNDREDS, 
and sometimes you don't know which one is being used.


[...]
> What I'm saying is that strings definitely have an important application
> in the human interface. However, I feel strings might be overused in the
> Py3 API. Case in point: are pathnames bytes objects or strings?

Yes. On POSIX systems, file names are sequences of bytes, with a very few 
restrictions. On recent Windows file systems (NTFS I believe?), file 
names are Unicode strings encoded to UTF-16, but with a whole lot of 
other restrictions imposed by the OS.


> The
> linux position is that they are bytes objects. Py3 supports both
> interpretations seemingly throughout:
> 
>    open(b"/bin/ls")    vs    open("/bin/ls") os.path.join(b"a", b"b")   
>    vs    os.path.join("a", "b")

Because it has to, otherwise there will be files that are unreachable on 
one platform or another.


-- 
Steven



More information about the Python-list mailing list