Newbie question about text encoding

Sun Mar 8 04:01:19 EDT 2015

Steven D'Aprano wrote:

> Marko Rauhamaa wrote:
> 
>> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
>> 
>>> Marko Rauhamaa wrote:
>>>
>>>> That said, UTF-8 does suffer badly from its not being
>>>> a bijective mapping.
>>>
>>> Can you explain?
>> 
>> In Python terms, there are bytes objects b that don't satisfy:
>> 
>>    b.decode('utf-8').encode('utf-8') == b
> 
> Are you talking about the fact that not all byte streams are valid UTF-8?
> That is, some byte objects b may raise an exception on b.decode('utf-8').

Eh, I should have read the rest of the thread before replying...

> I don't see why that means UTF-8 "suffers badly" from this. Can you give
> an example of where you would expect to take an arbitrary byte-stream,
> decode it as UTF-8, and expect the results to be meaningful?

File names on Unix-like systems.

Unfortunately file names are a bit of a mess, but we're slowly converging on
Unicode support for files. I reckon that by 2070, 2080 tops, we'll have
that licked...

The three major operating systems have different levels of support for
Unicode file names:

* Apple OS X: HFS+ stores file names in decomposed form, using UTF-16. I
think this is the strictest Unicode support of all common file systems.
Well done Apple. Decomposed in this sense means that single code points may
be expanded where possible, e.g. é U+00E9 LATIN SMALL LETTER E WITH ACUTE
will be stored as two code points, U+0065 LATIN SMALL LETTER E + U+0301
COMBINING ACUTE ACCENT.

* Windows: NTFS stores file names as sequences of 16-bit code units except
0x0000. (Additional restrictions also apply: e.g. in POSIX mode, / is also
forbidden; in Win32 mode, / ? + etc. are forbidden.) The code units are
interpreted as UTF-16 but the file system doesn't prevent you from creating
file names with invalid sequences.

* Linux: ext2/ext3 stores file names as arbitrary bytes except for / and
nul. However most Linux distributions treat file names as if they were
UTF-8 (displaying ? glyphs for undecodable bytes), and many Linux GUI file
managers enforce the rule that file names are valid UTF-8.

File systems on removable media (FAT32, UDF, ISO-9660 with or without
extensions such as Joliet and Rock Ridge) have their own issues, but
generally speaking don't support Unicode well or at all.

So although the current situation is still a bit of a mess, there is a slow
move towards file names which are valid Unicode.

-- 
Steven