Problem writing some strings (UnicodeEncodeError)

Sun Jan 12 13:50:16 EST 2014

Paulo da Silva wrote:

> Em 12-01-2014 16:23, Peter Otten escreveu:
>> Paulo da Silva wrote:
>> 
>>> I am using a python3 script to produce a bash script from lots of
>>> filenames got using os.walk.
>>>
>>> I have a template string for each bash command in which I replace a
>>> special string with the filename and then write the command to the bash
>>> script file.
>>>
>>> Something like this:
>>>
>>> shf=open(bashfilename,'w')
>>> filenames=getfilenames() # uses os.walk
>>> for fn in filenames:
>>> ...
>>> cmd=templ.replace("<fn>",fn)
>>> shf.write(cmd)
>>>
>>> For certain filenames I got a UnicodeEncodeError exception at
>>> shf.write(cmd)!
>>> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>>>
>>> How can I fix this?
>>>
>>> Thanks for any help/comments.
>> 
>> You make it harder to debug your problem by not giving the complete
>> traceback. If the error message contains 'surrogates not allowed' like in
>> the demo below
>> 
>>>>> with open("tmp.txt", "w") as f:
>> ...     f.write("\udcef")
>> ...
>> Traceback (most recent call last):
>>   File "<stdin>", line 2, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in
>> position 0: surrogates not allowed
> 
> That is the situation. I just lost it and it would take a few houres to
> repeat the situation. Sorry.
> 
> 
>> 
>> you have filenames that are not valid UTF-8 on your harddisk.
>> 
>> A possible fix would be to use bytes instead of str. For that you need to
>> open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk()
>> call.
> This is my 1st time with python3, so I am confused!
> 
> As much I could understand it seems that os.walk is returning the
> filenames exactly as they are on disk. Just bytes like in C.

No, they are decoded with the preferred encoding. With UTF-8 that can fail, 
and if it does the surrogateescape error handler replaces the offending 
bytes with special codepoints:

>>> import os
>>> with open(b"\xe4\xf6\xfc", "w") as f: f.write("whatever")
... 
8
>>> os.listdir()
['\udce4\udcf6\udcfc']

You can bypass the decoding process by providing a bytes argument to 
os.listdir() (or os.walk() which uses os.listdir() internally):

>>> os.listdir(b".")
[b'\xe4\xf6\xfc']

To write these raw bytes into a file the file has of course to be binary, 
too.

> My template is a string. What is the result of the replace command? Is
> there any change in the filename from os.walk contents?
> 
> Now, if the result of the replace has the replaced filename unchanged
> how do I "convert" it to bytes type, without changing its contents, so
> that I can write to the bashfile opened with "wb"?
> 
> 
>> 
>> Or you just go and fix the offending names.
> This is impossible in my case.
> I need a bash script with the names as they are on disk.

I think instead of the hard way sketched out above it will be sufficient to 
specify the error handler when opening the destination file

shf = open(bashfilename, 'w', errors="surrogateescape")

but I have not tried it myself. Also, some bytes may need to be escaped, 
either to be understood by the shell, or to address security concerns:

>>> import os
>>> template = "ls <fn>"
>>> for filename in os.listdir():
...     print(template.replace("<fn>", filename))
... 
ls foo; rm bar