[I18n-sig] Autoguessing charset for Unicode strings?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 20 Jun 2001 08:57:12 +0200


> It would be possible to construct a table mapping ranges of Unicode
> codepoints (perhaps even character blocks) to certain legacy encodings
> so that the correct one can be chosen quickly. Something like this is
> needed when transcoding from Unicode to ISO-2022-CN.

That would be valuable as a general-purpose service in the Python
library, it seems. I have no experience with such API, but I think

codecs.find_encodings(ustring)

could work; this would return a list of tuples, each tuple containing
the name of an encoding and the number of initial characters of
ustring that can be represented in this encoding.

An important implementation detail, of course, is how to construct the
necessary data structures in an efficient way. For the codecs that
ship with Python, the tables could be precomputed. For dynamically
registered codecs, the first problem is to come up with a list of all
known codec names - which in itself would be a useful service...

Regards,
Martin