[I18n-sig] CJK codecs etc

M.-A. Lemburg mal@lemburg.com
Fri, 17 Mar 2000 09:40:49 +0100


Christian Wittern wrote:
> 
> Marc-Andre Lemburg wrote:
> 
> > Christian Wittern wrote:
> > >
> > >
> > > 1.) Please provide a (configurable?) fallback for failed
> > conversions. This
> > > is of course especially needed for conversions out of Unicode.
> > What I have
> > > in mind is, for example, provide the Unicode codepoint as
> > entity (&U-4e00;)
> > > or Java escape or some such, depending on the users choice.
> > Don't just give
> > > a '?', what M$'s braindead conversion routines do and thus
> > regularily drive
> > > me nuts.
> >
> > Please read the Misc/unicode.txt file. There are different error
> > handling techniques available... 'strict' (raise an error),
> > 'ignore' (ignore the failed mapping), 'replace' (replace the
> > failed mapping by some codec specific replacement char, e.g. '?').
> 
> Err. If you read my comment above, this is exactly what I *don't* want to
> see, since this is of no help at all. What I want to have is a fallback
> mechanism, that preserves the information contained in the file (or maps it
> to some other second best match). Simple raising an error or putting in some
> default char is not helpful to the user at all!!!

Codecs may provide more than these three error handling
modes -- the only requirement is that at least these three
are defined.

Note that 'replace' and 'ignore' do have their value when
it comes to writing code that puts more priority on working
without errors than 100% percent correct output.

> > The error argument is codec specific -- the above values must
> > work though.
> >
> > > 2.) On the same topic, there are some fairly frequently
> > codepoints that map
> > > to different codepoints in Japanese and Taiwans encoding,
> > although this is
> > > in most cases not expected. These codepoints should have been
> > eliminated by
> > > Unicodes unification rules, but crept in via the
> > source-encoding separation
> > > rule -- not a very good decision in my opinion. I have a list
> > of some such
> > > characters at http://www.chibs.edu.tw/~chris/smart/cjkconv.htm, Ideally,
> > > there should be a way for the user to influence the conversion
> > by providing
> > > a list of his choice (with his modifications) to the codec, to
> > overlay the
> > > predefined values.
> >
> > Everybody can write their own codecs... so no comment on this one ;-)
> >
> > > 3.) The nasty problem of user defined characters. I think there
> > should be a
> > > default mapping of the user defined area in DBCS encodings to
> > the Unicode
> > > code range for user characters. Microsoft uses fixed sequential
> > tables and I
> > > think that is a good idea, since it is pretty straightforward.
> > In big5 for
> > > example, the area of user defined characters starts at Fa40,
> > Fa41 ..., which
> > > gets mapped to Unicode E000, E001, .. There should also be an
> > option to use
> > > some kind of entity reference instead.
> >
> > The core Python Unicode implementation doesn't touch these
> > private code areas at all. This issue is left to the codecs.
> >
> > Since they are probably of some importance to the Asian world
> > due to the many corporate char sets, I guess the Asian codecs
> > should provide some kind of logic to handle these areas as
> > special cases... perhaps by passing an extra mapping table
> > to the codec.
> 
> That would solve the above point 2 as well and is all I have in mind here:
> Leave some hook that the user can pass some overlayed extra mapping table,
> without having to write a codec of his own. ALthough I realize the latter is
> possible, I don't think it is practicle and maybe not even desirable. I
> don't want to design a different car from scratch, just because I don't like
> the color:-)

I think we are starting to pile up some good comments on
what the Asian codecs should look like... perhaps its time
for someone to jump in and write a proposal as basis for further
discussion.

(I don't have time for this and not even enough knowledge about
the complexity of the Asian encodings, so I'll leave this to
one of you...)

> >
> > > 4.) I developped years ago the habit of using entity references for any
> > > characters not representable in the given characterset used by
> > the system. I
> > > have seen this becoming more widespread in the user communities
> > I work with.
> > > It would be very useful for us, if the Unicode conversion
> > routines in Python
> > > could be told to tread some arbitray entity references (we use
> > things like
> > > &M24501; for the characters assigned by the Mojikyo Font Institute (see
> > > www.mojikyo.gr.jp) and &C4-4e21; for characters in the Taiwanese CNS
> > > encoding). I realize that this is a rather specialised usage,
> > but it would
> > > be great and very helpful to have some hook in the system to treat this
> > > stuff just like any other character.
> >
> > Hmm, sounds like some kind of SGML entity codec could solve this
> > aspect...
> 
> Right, but how would that be integrated with the other codecs?

Codecs are stackable :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/