[Python-ideas] Support WHATWG versions of legacy encodings

Rob Speer rspeer at luminoso.com
Tue Feb 6 18:26:38 EST 2018


By now, it sounds right to me that I should implement these codecs in a
package. I accept that I've established the use case, but not sufficiently
established why it belongs in Python.

The package can easily be ftfy -- although I should point out that what's
in ftfy at the moment isn't quite right! "ftfy.bad_codecs" implements the
"fall back on Latin-1" idea that many people here have intuitively
suggested, because I was implementing it just based on the evidence of text
I saw; I didn't know at the time that there was an actual standard
involved. The result differs subtly from what Web browsers do in cases
outside the C1 range. But of course I can work on re-implementing the
encodings correctly based on what I've learned.

I think it would be best if these encodings were actually implemented in
the "webencodings" package, or in a package that both ftfy and webencodings
could use. I have certainly encountered cases in web scraping where,
because webencodings doesn't use the same Windows-1252 as the actual web
does, I have had to decode the text even more incorrectly using Latin-1 and
_then_ run it through ftfy -- in effect, adding a layer of mojibake so I
can fix two layers of mojibake. That's kind of absurd and it's why I
thought this belonged in Python itself. But I'll talk to the webencodings
author instead.

On Tue, 6 Feb 2018 at 05:12 Stephen J. Turnbull <
turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:

> Nick Coghlan writes:
>
>  > Personally, I think a See Also note pointing to ftfy in the "codecs"
>  > module documentation would be quite a reasonable outcome of the thread
>
> Yes please.  The more I hear about purported use cases (with the
> exception of Nathaniel's "don't crash when I manipulate the DOM" case,
> which is best handled by errors='surrogateescape'), the less I see
> anything "standard" about them.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180206/1337f924/attachment.html>


More information about the Python-ideas mailing list