how to handle surrogate encoding: read from fs write to database

Sun Jun 12 08:09:31 EDT 2016

Hi, everybody.

What is a best practice to deal with filenames in python3? The problem is
that os.walk(src_dir), os.listdir(src_dir), ... return "surrogate" strings
as filenames. It is impossible to assume that they are normal strings that
could be print()'ed on unicode terminal or saved as as string into database
(mongodb) as they'll issue UnicodeEncodeError on surrogate character. So,
how to handle this situation?

The first solution I found was to convert filenames to bytes and use them.
But that's not nice. Once I need to compare filename with some string I'll
have to convert strings to bytes. Also Bytes() objects are base64 encoded
in mongo shell and thus they are hard to read, *e.g. "binary" :
BinData(0,"c29tZSBiaW5hcnkgdGV4dA==")*. Finally PEP 383 states that using
bytes does not work in windows (btw, why?).

Another option I found is to work with filenames as surrogate strings but
enc them to 'latin-1' before printing/saving into database:
    filename.encode(fse, errors='surrogateescape').decode('latin-1')
This way I like more since latin symbols are clearly visible in mongo
shell. Yet I doubt this is best solution.

Ideally I would like to send surrogate strings to database or to terminal
as is and let db/terminal handle them. IOW let terminal print garbage where
surrogate letters appear. Is this possible in python?

So what do you think: is  usage unicode strings and explicit conversion to
latin-1 a good option?

Also related question: is it possible to detect surrogate symbols in
strings? I found suggestion to use re.compile('[\ud800-\uefff]+'). Yet all
this stuff feels to hacky for me, so I would like some confirmation that
this is the right way.

Thanks in advance and sorry for touching this matter again. Too many
discussions and not evident what is the current state of art here.
--
Peter.