[Python-Dev] Encoding detection in the standard library?

Wed Apr 23 07:04:54 CEST 2008

> I certainly agree that if the target set of documents is small enough it
> is possible to hand-code the encoding.  There are many applications,
> however, that need to examine the content of an arbitrary, or at least
> non-small set of web documents.  To name a few such applications:
> 
>  - web search engines
>  - translation software

I'll question whether these are "many" programs. Web search engines
and translation software have many more challenges to master, and
they are fairly special-cased, so I would expect they need to find
their own answer to character set detection, anyway (see Bill Janssen's
answer on machine translation, also).

>  - document/bookmark management systems
>  - other kinds of document analysis (market research, seo, etc.)

Not sure what specifically you have in mind, however, I expect that
these also have their own challenges. For example, I would expect
that MS-Word documents are frequent. You don't need character set
detection there (Word is all Unicode), but you need an API to look
into the structure of .doc files.

> Not that I can substantiate.  Documents & feeds covers a lot of what is
> on the web--I was only trying to make the point that on the web,
> whenever an encoding can be specified, it will be specified incorrectly
> for a significant chunk of exemplars.

I firmly believe this assumption is false. If the encoding comes out of
software (which it often does), it will be correct most of the time.
It's incorrect only if the content editor has to type it.

> Indeed, if it is only one site it is pretty easy to work around.  My
> main use of python is processing and analyzing hundreds of millions of
> web documents, so it is pretty easy to see applications (which I have
> listed above).

Ok. What advantage would you (or somebody working on a similar project)
gain if chardet was part of the standard library? What if it was not
chardet, but some other algorithm?

> I can also think of good arguments for excluding encoding detection for
> maintenance reasons: is every case of the algorithm guessing wrong a bug
> that needs to be fixed in the stdlib?  That is an unbounded commitment.

Indeed, that's what I meant with my initial remark. People will expect
that it works correctly - both with the consequence of unknowingly
proceeding with the incorrect response, and then complaining when they
find out that it did produce an incorrect answer.

For chardet specifically, my usual standard-library remark applies:
it can't become part of the standard library unless the original
author contributes it, anyway. I would then hope that he or a group
of people would volunteer to maintain it, with the threat of removing
it from the stdlib again if these volunteers go away and too many
problems show up.

Regards,
Martin