[I18n-sig] Mixed encodings and XML

Tom Emerson tree@basistech.com
Wed, 13 Dec 2000 18:09:47 -0500


Uche Ogbuji writes:
> It contains many sections within HTML PREs with the different encodings
> I mentioned.  They look like
> 
> <PRE LANG="zh-TW">
> ... BIG5-encoded stuff ...
> </PRE>

The LANG attribute does not specify an encoding, it specifies a
language. You cannot safely imply anything about the encoding based on
the value of the LANG attribute. For example, "zh-TW" text could be
encoded in Big 5, Big 5+, GBK, CP950, CP936, EUC-CN (depending on the
text), ISO-2022-CN, ISO-2022-CN-EXT, and others.

The LANG attribute can be used by the application to help generate the
appropriate glyph variants, however, though I don't know of any off
hand that do this.

> I need to convert the document to XML Docbook format.  My naive attempts
> at converting to 
> 
> <screen xml:lang="zh-TW">
> ... BIG5-encoded stuff ...
> </screen>
>
> Of course don't work because the parser takes one look at the BIG5 and
> throws a well-formedness error.

Which it is required to do, see Section 4.3.3 of the XML specification.

> Is there any way to manage this besides using XInclude?  Do any of the
> Python parsers have any tricks that could help?

Convert all of those sections into Unicode, using UTF-8 as the
encoding form. You could write a trivial Python script to do this for
you.

The bigger problem (IMHO) will be convincing your DocBook tool chain
to handle the Asian characters. If you find a good solution to that
(i.e., allowing Simplified and Traditional Chinese, Korean, and (say)
Thai in a single document) let me know.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"