[Tutor] Encode problem

Tue May 5 07:14:24 CEST 2009

"spir" <denis.spir at free.fr> wrote in message 
news:20090501220601.31891dfc at o...
> Le Fri, 1 May 2009 15:19:29 -0300,
> "Pablo P. F. de Faria" <pablofaria at gmail.com> s'exprima ainsi:
>
>> self.cfg.write(codecs.open(self.properties_file,'w','utf-8'))
>>
>> As one can see, the character encoding is explicitly UTF-8. But
>> ConfigParser keeps trying to save it as a 'ascii' file and gives me
>> error for directory-names containing >128 code characters (like "Á").
>> It is just a horrible thing to me, for my app will be used mostly by
>> brazillians.
>
> Just superficial suggestions, only because it's 1st of May and WE so that 
> better answers won't maybe come up before monday.
>
> If all what you describe is right, then there must be something wrong with 
> char encoding in configParser's write method. Have you had a look at it? 
> While I hardly imagine why/how ConfigParser would limit file pathes to 
> 7-bit ASCII...
> Also, for porteguese characters, you shouldn't even need explicit 
> encoding; they should pass through silently because they fit in an 8 bit 
> latin charset. (I never encode french path/file names.)

The below works.  ConfigParser isn't written to support Unicode correctly. 
I was able to get Unicode sections to write out, but it was just luck. 
Unicode keys and values break as the OP discovered.  So treat everything as 
byte strings:

----------------------------------------------------
# coding: utf-8
# Note coding is required because of non-ascii
# in the source code.  This ONLY controls the
# encoding of the source file characters saved to disk.
import ConfigParser
import glob
import sys
c = ConfigParser.ConfigParser()
c.add_section('马克') # this is a utf-8 encoded byte string...no u'')
c.set('马克','多少','明白') # so are these

# The following could be glob.glob(u'.') to get a filename in
# Unicode, but this is for illustration that the encoding of the
# source file has no bearing on the encoding strings other than
# one's hard-coded in the source file.  The 'files' list will be byte
# strings in the default file system encoding.  Which for Windows
# is 'mbcs'...a magic value that changes depending on the
# which country's version of Windows is running.
files = glob.glob('*.txt')
c.add_section('files')

for i,fn in enumerate(files):
    fn = fn.decode(sys.getfilesystemencoding())
    fn = fn.encode('utf-8')
    c.set('files','file%d'%(i+1),fn)

# Don't need a codec here...everything is already UTF8.
c.write(open('chinese.txt','wt'))
--------------------------------------------------------------

Here is the content of my utf-8 file:

-----------------------------
[files]
file3 = ascii.txt
file2 = chinese.txt
file1 = blah.txt
file5 = ÀÈÌÒÙ.txt
file4 = other.txt

[马克]
多少 = 明白
----------------------------

Hope this helps,
Mark