[XML-SIG] Re: [I18n-sig] Mixed encodings and XML
Tom Emerson
tree@basistech.com
Wed, 13 Dec 2000 18:09:47 -0500
Uche Ogbuji writes:
> It contains many sections within HTML PREs with the different encodings
> I mentioned. They look like
>
> <PRE LANG="zh-TW">
> ... BIG5-encoded stuff ...
> </PRE>
The LANG attribute does not specify an encoding, it specifies a
language. You cannot safely imply anything about the encoding based on
the value of the LANG attribute. For example, "zh-TW" text could be
encoded in Big 5, Big 5+, GBK, CP950, CP936, EUC-CN (depending on the
text), ISO-2022-CN, ISO-2022-CN-EXT, and others.
The LANG attribute can be used by the application to help generate the
appropriate glyph variants, however, though I don't know of any off
hand that do this.
> I need to convert the document to XML Docbook format. My naive attempts
> at converting to
>
> <screen xml:lang="zh-TW">
> ... BIG5-encoded stuff ...
> </screen>
>
> Of course don't work because the parser takes one look at the BIG5 and
> throws a well-formedness error.
Which it is required to do, see Section 4.3.3 of the XML specification.
> Is there any way to manage this besides using XInclude? Do any of the
> Python parsers have any tricks that could help?
Convert all of those sections into Unicode, using UTF-8 as the
encoding form. You could write a trivial Python script to do this for
you.
The bigger problem (IMHO) will be convincing your DocBook tool chain
to handle the Asian characters. If you find a good solution to that
(i.e., allowing Simplified and Traditional Chinese, Korean, and (say)
Thai in a single document) let me know.
-tree
--
Tom Emerson Basis Technology Corp.
Zenkaku Language Hacker http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"