[Tutor] Unicode Encode Error

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Apr 27 23:12:40 CEST 2006


>>> You're right, I realised after playing with Tim's example that the 
>>> problem was that I wasn't calling close() on the codecs file. Adding 
>>> this after the f.write(html_text) seems to flush the buffer which 
>>> means that the content now gets written to the file.
>>
>> Quick note: it may be important to write and read from the file using 
>> binary mode "b".  It's not so significant under Unix, but it is more 
>> significant under Windows, because otherwise we may get some weird 
>> results.
>
> But the file is utf-8 text, ISTM it should be written as text, not 
> binary. Why do you recommend binaray mode?

Hi Kent,

Oh!  I just wrote that out because I had a vague and fuzzy feeling that 
utf-8, having high-order binary bits, needed to be written carefully. 
But let me examine that unexamined assumption...

No, you're right, we don't have to be so careful here, for carriage 
returns and newlines have their standard interpretation under utf-8 too. 
Ok, good to know.  Thank you!


I'd seen too many problems with Windows and binary data that I do 'rb' out 
of habit whenever dealing with high-order binary data.  For example, 
ord(26) causes Windows to prematurely truncate the reading of a file in 
text mode:

     http://mail.python.org/pipermail/python-list/2003-March/154659.html

On a close reading of how the utf-8 encoding standard, though, I see that 
it does say that utf-8 avoids encoding high Unicode code points with 
control characters, so my caution is unfounded.


More information about the Tutor mailing list