how to write a unicode string to a file ?

Sat Oct 17 12:46:50 EDT 2009

"Mark Tolonen" <metolone+gmane at gmail.com> wrote in message 
news:hbbo6d$6ue$1 at ger.gmane.org...
>
> "Kee Nethery" <kee at kagi.com> wrote in message 
> news:AAAB63C6-6E44-4C07-B119-972D4F49E511 at kagi.com...
>>
>> On Oct 16, 2009, at 5:49 PM, Stephen Hansen wrote:
>>
>>> On Fri, Oct 16, 2009 at 5:07 PM, Stef Mientki  <stef.mientki at gmail.com> 
>>> wrote:
>>
>> snip
>>
>>> The thing is, I'd be VERY surprised (neigh, shocked!) if Excel can't 
>>> open a file that is in UTF8-- it just might need to be TOLD that its 
>>> utf8 when you go and open the file, as UTF8 looks just like ASCII --  
>>> until it contains characters that can't be expressed in ASCII. But I 
>>> don't know what type of file it is you're saving.
>>
>> We found that UTF-16 was required for Excel. It would not "do the  right 
>> thing" when presented with UTF-8.
>
> Excel seems to expect a UTF-8-encoded BOM (byte order mark) to correctly 
> decide a file is written in UTF-8.  This worked for me:
>
> f=codecs.open('test.csv','wb','utf-8')
> f.write(u'\ufeff') # write a BOM
> f.write(u'马克,testing,123\r\n')
> f.close()
>
> When opened in Excel without the BOM (\ufeff), I got gibberish, but with 
> the BOM the Chinese characters were displayed correctly.

Also, it turns out the python 'utf-16' encoder adds a BOM for you, which is 
probably why UTF-16 worked for you and UTF-8 didn't:

>>> u'\u0102'.encode('utf-16-be') # explicit big-endian, no BOM
'\x01\x02'
>>> u'\u0102'.encode('utf-16-le') # explicit little-endian, no BOM
'\x02\x01'
>>> u'\u0102'.encode('utf-16') # machine native-endian, with BOM
'\xff\xfe\x02\x01'
>>> u'\u0102'.encode('utf-8') # no BOM
'\xc4\x82'
>>> u'\ufeff\u0102'.encode('utf-8') # explicit BOM
'\xef\xbb\xbf\xc4\x82'

-Mark