Printing unicode to file

Martin von Loewis loewis at informatik.hu-berlin.de
Wed Oct 24 05:55:49 EDT 2001


Steven Cummings <stevenc at engineer.com> writes:

> If I encode('utf-8') on the way out to file, it may work, but it's
> hard to tell because for some reason it takes up to three minutes
> just to do the encode (?!?).

I find that unlikely; it must spend the time doing something else
(unless you have documents in the megabytes range).

> And I'm not so sure if all of the characters will get translated
> correctly. What is the best way to go about this, altering text from
> MSWord through COM and generally writing it out to some text format?

If the output is HTML, you should make sure that you declare the
encoding properly inside the document, using a Meta tag, and perhaps
an XML header, see

http://www.w3.org/International/O-charset.html

If you declare the document correctly to be UTF-8, today's web
browsers should have no problems displaying it correctly.

Of course, if you don't want to risk any misunderstanding, you should
restrict yourself to ASCII, and use character entities (&#decimalnum;)
for characters above 127.

There is an ongoing debate on how to do this efficiently in Python.
My recommendation would be

try:
  result = unistring.encode("ascii")
except UnicodeError:
  # fall back to slow method if there are non-ascii characters
  result = [None]*len(unistring)
  i = 0
  for c in unistring:
    c = ord(c)
    if c < 128:
      result[i] = chr(c)
    else:
      result[i] = '&#%d;' % c
    i += 1
  result = ''.join(result)

Of course, if unistring is the complete output, and there is a high
chance of non-ASCII characters in your document, it will always fall
back. In that case, it may be reasonable to split the output into
smaller blocks, trying to convert each of them to ASCII.

Regards,
Martin



More information about the Python-list mailing list