A Unicode problem -HELP
"Martin v. Löwis"
martin at v.loewis.de
Wed May 17 02:08:26 EDT 2006
manstey wrote:
> a=str(word_info + parse + gloss).encode('utf-8')
> a=a[1:len(a)-1]
>
> Is this clearer?
Indeed. The problem is your usage of str() to "render" the output.
As word_info+parse+gloss is a list (or is it a tuple?), str() will
already produce "Python source code", i.e. an ASCII byte string
that can be read back into the interpreter; all Unicode is gone
from that string. If you want comma-separated output, you should
do this:
def comma_separated_utf8(items):
result = []
for item in items:
result.append(item.encode('utf-8'))
return ", ".join(result)
and then
a = comma_separated_utf8(word_info + parse + gloss)
Then you don't have to drop the parentheses from a anymore, as
it won't have parentheses in the first place.
As the encoding will be done already in the output file,
the following should also work:
a = u", ".join(word_info + parse + gloss)
This would make "a" a comma-separated unicode string, so that
the subsequent output_file.write(a) encodes it as UTF-8.
If that doesn't work, I would like to know what the exact
value of gloss is, do
print "GLOSS IS", repr(gloss)
to print it out.
Regards,
Martin
More information about the Python-list
mailing list