[XML-SIG] Re: [4suite] Output encodings again

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 11 Sep 2000 02:26:31 -0600


Bet you thought we forgot, eh?

Carey Evans wrote:

> At least for XML output where we can expect some consistency among
> parsers, as few characters should be encoded as &#xxx; as possible.
> It should only be necessary to do this if the character doesn't exist
> in the output encoding.

Thanks again for all the bug-reports.  I've devoted several hours to
your comments alongside Tony Graham's Unicode book, the C code for the
Python wstrop module and the 4Suite code.

I think I've addressed the core bug you uncovered, but I need a bit of
help to go the rest of the way.

Currently, on output to XML (and HTML), we first convert the UTF-8 that
the DOM uses into Martin von Lowis's wchar type.  Then we use wstring to
convert from wchar to, say, ISO-8859-2.  The problem is that if there
are characters in the DOM that have no ISO-8859-2 representation, wstrop
only gives us two choices: throw an exception without any data about
where the offending character was, or skip it entirely.

So I'm rather at a loss as to how to efficiently escape such characters
for XML output.  I know I want to render them as &#???;, but every
method I see for doing so is rather wasteful.

Of course with Python 2.0 we can just use Python's unicode type for the
DOMString so hopefully this problem would tend to go away, no?

Does anyone on xml-sig catch my drift?  Am I missing some capabilities
of wstrop?  Should I just hang on a few more months for Python 2.0?


> For HTML it may be necessary to escape some characters, given web
> browsers' tendency to try to guess encodings.  I guess this is what
> the text at the bottom of TextWriter.py is talking about.

I think I've made the writers much smarter for HTML.  4Suite will now
respect most encodings, besides the problem I mention above, and turn
characters from   to ÿ into the equivalent HTML entities (e.g.
  -> &nbsp).

-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python