XML, Unicode

Tue Oct 1 08:14:56 EDT 2002

pcarey at lexmark.com writes:

> 1) Is it "safe" to use UTF-8 encoding for html pages. (These pages will be
> seen by lots of folks around the world.)

Yes. Make sure to put a META tag into the HTML, to override any
encoding that the web server or browser may guess.

> 2) I use codecs.open("filename.html", "w+", "utf8") to create the html
> pages, encoded in utf-8; is this correct?

Yes. Make sure you don't write byte strings into the stream, only
Unicode objects. The only exception are pure-ASCII byte strings, such
as tag or attribute names and spacing.

> 3) The xml is all utf-8, and it appears that, when building strings,
> non-utf8 strings are coerced?

You mean, non-Unicode objects? Yes, adding Unicode and byte strings
gives a Unicode string; the byte string is converted with the system
encoding (ascii).

[UTF-8 is a byte encoding; Unicode objects do not use UTF-8
internally, so saying that the "XML is utf-8" is a bit imprecise: it
was UTF-8 on disk, but isn't anymore when you process it]

> 4) Why do I hafta use '\r\n' for the "Newline character" instead of '\n'?

You don't have to; the resulting HTML file will be work just fine with
\n only.

It is true that codecs.open opens the file in binary mode, so \n is
not transparently converted to \r\n. That could be considered as a
bug, however, that bug is difficult to fix: normally, you cannot rely
on the C library to correctly understand the notion of a newline if
the file has a non-native encoding. So there is no instance performing
text mode conversions here.

> Again, everything seems to work great; I'm just a little gunshy
> about royally screwing up.

If you have tested your code with various funny characters (including
characters not supported in the "native" code page), I think you can
declare victory.

Regards,
Martin