Changing filenames from Greeklish => Greek (subprocess complain)

Chris Angelico rosuav at gmail.com
Mon Jun 3 03:36:39 EDT 2013


On Mon, Jun 3, 2013 at 4:46 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Then, when
> you try to read the file names in UTF-8, you hit an illegal byte, half of
> a surrogate pair perhaps, and everything blows up.

Minor quibble: Surrogates are an artifact of UTF-16, so they're 16-bit
values like 0xD808 or 0xDF45. Possibly what you're talking about here
is a continuation byte, which in UTF-8 are used only after a lead
byte. For instance: 0xF0 0x92 0x8D 0x85 is valid UTF-8, but 0x41 0x92
is not.

There is one other really annoying thing to deal with, and that's the
theoretical UTF-8 encoding of a UTF-16 surrogate. (I say "theoretical"
because strictly, these are invalid; UTF-8 does not encode invalid
codepoints.) 0xED 0xA0 0x88 and 0xED 0xBD 0x85 encode the two I
mentioned above. Depending on what's reading the filename, you might
actually have these throw errors, or maybe not. Python's decoder is
correctly strict:

>>> str(b'\xed\xa0\x88','utf-8')
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    str(b'\xed\xa0\x88','utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2:
invalid continuation byte

Actually, I'm not sure here, but I think that error message may be
wrong, or at least unclear. It's perfectly possible to decode those
bytes using the UTF-8 algorithm; you end up with the value 0xD808,
which you then reject because it's a surrogate. But maybe the Python
UTF-8 decoder simplifies some of that.

ChrisA



More information about the Python-list mailing list