[Python-Dev] Automatic encoding detection [was: Re: Python3 "complexity" - 2 use cases]

Jim J. Jewett jimjjewett at gmail.com
Tue Jan 14 00:48:36 CET 2014



>> So when it is time to guess [at the character encoding of a file],
>> a source of good guesses is an important battery to include.

> The barrier for entry to the standard library is higher than mere
> usefulness.

Agreed.  But "most programs will need it, and people will either
include (the same) 3rd-party library themselves, or write their
own workaround, or have buggy code" *is* sufficient.

The points of contention are

    (1)  How many programs have to deal with documents written
         outside their control -- and probably originating on
         another system.

I'm not ready to say "most" programs in general, but I think that
barrier is met for both web clients (for which we already supply
several batteries) and quick-and-dirty utilities.

    (2)  How serious are the bugs / How annoying are the workarounds?

As someone who mostly sticks to English, and who tends to manually
ignore stray bytes when dealing with a semi-binary file format,
the bugs aren't that serious for me personally.  So I may well
choose to write buggy programs, and the bug may well never get
triggered on my own machine.

But having a batch process crash one run in ten (where it didn't
crash at all under Python 2) is a bad thing.  There are environments
where (once I knew about it) I would add chardet (if I could get
approval for the 3rd-party component).

    (3)  How clearcut is the *right* answer?

As I said, at one point (several years ago), the w3c and whatwg
started to standardize the "right" answer.  They backed that out,
because vendors wanted the option to improve their detection in
the future without violating standards.

There are certainly situations where local knowledge can do
better than a global solution like chardet,  but ... the
"right" answer is clear most of the time.

Just ignoring the problem is still a 99% answer, because most text
in ASCII-mostly environments really is "close enough".  But that
is harder (and the One Obvious Way is less reliable) under Python 3
than it was under Python 2.

An alias for "open" that defaulted to surrogate-escape (or returned
the new "ASCIIstr" bytes hybrid) would probably be sufficient to get
back (almost) to Python 2 levels of ease and reliability.  But it
would tend to encourage ASCII/English-only assumptions.

You could fix most of the remaining problems by scripting a web
browser, except that scripting the browser in a cross-platform
manner is slow and problematic, even with webbrowser.py.

"Whatever a recent Firefox does" is (almost by definition) good
enough, and is available ... but maybe not in a convenient form,
which is one reason that chardet was created as a port thereof.
Also note that firefox assumes you will update more often than
Python does.

"Whatever chardet said at the time the Python release was cut"
is almost certainly good enough too.

The browser makers go to great lengths to match each other even 
in bizarre corner cases.  (Which is one reason there aren't more
competing solutions.)  But that doesn't mean it is *impossible*
to construct a test case where they disagree -- or even one where
a recent improvement in the algorithms led to regressions for one
particular document.

That said, such regressions should be limited to documents that
were not properly labeled in the first place, and should be rare
even there.  Think of the changes as obscure bugfixes, akin to
a program starting to handle NaN properly, in a place where it
"should" not ever see one.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ



More information about the Python-Dev mailing list