Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Jun 5 01:56:36 EDT 2013


On Tue, 04 Jun 2013 10:23:33 -0700, Νικόλαος Κούρας wrote:

> What on eart is this damn error: Michael tried to explain to me about
> surrogates but dont think i understand it.
> 
> Encoding giving me trouble years now.
> 
> [Tue Jun 04 20:19:53 2013] [error] [client 46.12.95.59] Original
> exception was: [Tue Jun 04 20:19:53 2013] [error] [client 46.12.95.59]
> Traceback (most recent call last): [Tue Jun 04 20:19:53 2013] [error]
> [client 46.12.95.59]   File "files.py", line 72, in <module> [Tue Jun 04
> 20:19:53 2013] [error] [client 46.12.95.59]     cur.execute('''SELECT
> url FROM files WHERE url = %s''', (fullpath,) ) [Tue Jun 04 20:19:53
> 2013] [error] [client 46.12.95.59]   File
> "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/
cursors.py",
> line 108, in execute [Tue Jun 04 20:19:53 2013] [error] [client
> 46.12.95.59]     query = query.encode(charset) [Tue Jun 04 20:19:53
> 2013] [error] [client 46.12.95.59] UnicodeEncodeError: 'utf-8' codec
> can't encode character '\\udcd3' in position 61: surrogates not allowed
> 
> 
> 
> PLEASE TELL EM WHAT TO TRY, PLEASE FOR THE LOVE OF GOD, IAM SO
> FRUSTRATED NOT BEING ABLE TO DEAL WITH THIS.

Calm down. I know it is frustrating.

On a Linux system, the file system stores bytes, and only bytes. The file 
system does no validation of the bytes you give, except to check that 
there are no 0x00 and 0x2f bytes (ASCII '\0' and '/') in the file name. 
That's all.

So, if one program thinks that it should be sending file names in, say, 
UTF-16 or or ISO-8859-7 encoding, it will take a string like "Νικόλαος" 
and the file system will see bytes like these:

py> s = 'Νικόλαος'
py> s.encode('UTF-16be')
b'\x03\x9d\x03\xb9\x03\xba\x03\xcc\x03\xbb\x03\xb1\x03\xbf\x03\xc2'

py> s.encode('iso-8859-7')
b'\xcd\xe9\xea\xfc\xeb\xe1\xef\xf2'


Notice that the same string gives you completely different bytes. And 
likewise, the same bytes will give you different strings, depending on 
the encoding you use.


Now, if you try to read the file name using a program that expects UTF-8, 
it will either see some sort of mojibake garbage characters, or get some 
sort of error:

py> s.encode('UTF-16be').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 1: 
invalid start byte

py> s.encode('iso-8859-7').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 0: 
invalid continuation byte


Somehow, I don't know how because I didn't see it happen, you have one or 
more files in that directory where the file name as bytes is invalid when 
decoded as UTF-8, but your system is set to use UTF-8. So to fix this you 
need to rename the file using some tool that doesn't care quite so much 
about encodings. Use the bash command line to rename each file in turn 
until the problem goes away.



-- 
Steven



More information about the Python-list mailing list