[I18n-sig] Mixed encodings and XML

Andy Robinson andy@reportlab.com
Thu, 14 Dec 2000 16:16:03 -0000


> 1.  Is there any way to convince an XML parser to work with
> source with mixed
> encoding.  The exchange with you has helped disabuse me of
> any silly notion
> that this might be so.  So I shall have to use XInclude.
>
> 2.  Will the results of the rendering be such that the
> LATIN-1 parts can be
> read normally and the portions with other encodings would
> be available for cut
> and paste?  If I use XInclude, no reason why not.

I did exactly this in an internal help page for a company that was
learning this stuff a year ago.  I don't see a problem, because most
CJKV encodings are 8-bit and ASCII compatible. Declare the document as
Latin-1 - because that way your parser will not choke on or corrupt
bytes above 127.  Then paste in text in whatever encoding you want.
Any Kanji text in one of the common ASCII-compatible encodings
(Shift-JIS, EUC, or even UTF8) will appear as gobbledegook, but the
underlying bytes will not be corrupted, so they should be able to
paste them out.  You should be able to transform the whole document
from iso-latin-1 to utf8 and back without loss of data; do a quick
test from Python to verify it.

Not exactly an industrial solution, but it's not exactly an industrial
problem.

It would of course go horribly wrong if you used exotic encodings like
UTF-16 with null bytes :-)

- Andy Robinson