A Unicode problem -HELP

"Martin v. Löwis" martin at v.loewis.de
Wed May 17 02:08:26 EDT 2006


manstey wrote:
> 			a=str(word_info + parse + gloss).encode('utf-8')
> 			a=a[1:len(a)-1]
> 
> Is this clearer?

Indeed. The problem is your usage of str() to "render" the output.
As word_info+parse+gloss is a list (or is it a tuple?), str() will
already produce "Python source code", i.e. an ASCII byte string
that can be read back into the interpreter; all Unicode is gone
from that string. If you want comma-separated output, you should
do this:

def comma_separated_utf8(items):
    result = []
    for item in items:
        result.append(item.encode('utf-8'))
    return ", ".join(result)

and then
             a = comma_separated_utf8(word_info + parse + gloss)

Then you don't have to drop the parentheses from a anymore, as
it won't have parentheses in the first place.

As the encoding will be done already in the output file,
the following should also work:

              a = u", ".join(word_info + parse + gloss)

This would make "a" a comma-separated unicode string, so that
the subsequent output_file.write(a) encodes it as UTF-8.

If that doesn't work, I would like to know what the exact
value of gloss is, do

  print "GLOSS IS", repr(gloss)

to print it out.

Regards,
Martin



More information about the Python-list mailing list