[Python-ideas] Support WHATWG versions of legacy encodings

Mon Feb 5 05:52:41 EST 2018

On 05.02.2018 04:01, Nick Coghlan wrote:
> On 2 February 2018 at 16:52, Steven D'Aprano <steve at pearwood.info> wrote:
>> If it were my decision, I'd have these codecs raise a warning (not an
>> error) when used for encoding. But I guess some people will consider
>> that either going too far or not far enough :-)
> 
> Rob pointed out that one of the main use cases for these codecs is
> when going "Oh, this was decoded with a WHATWG encoding, which isn't
> right, so I need to re-encode it with that encoding, and then decode
> it with the right encoding". So encoding is very much part of the
> usage model: it's needed when you've received the data over a Unicode
> based interface rather than a binary one.

So the use case for encoding into WHATWG is to undo the WHATWG
mappings by then decoding using the standard mappings and using
an error handler to deal with decoding issues ?

This strikes me as a rather unrealistic use case, esp. since
it's likely that the original decoding was also done in Python,
so the much more intuitive approach to fix this problem would
be to not use WHATWG encodings for the initial decoding in the first
place.

> So I think the *use case* for the WHATWG encodings has been pretty
> well established. What hasn't been established is whether our answer
> to "How do I handle the WHATWG encodings?" is going to be:
> 
> * "Here they are in the standard library (for 3.8+)!"; or
> * "These are available as part of the 'ftfy' library on PyPI, which
> also helps fixes various other problems in decoded text"
> 
> Personally, I think a See Also note pointing to ftfy in the "codecs"
> module documentation would be quite a reasonable outcome of the thread
> - when it comes to consuming arbitrary data from the internet and
> cleaning up decoding issues, ftfy's data introspection based approach
> is likely to be far easier to start with than characterising the
> common errors for specific data sources and applying them
> individually, and if you're already using ftfy to figure out which
> fixes are needed, then it shouldn't be a big deal to keep it around
> for the more relaxed codecs that it provides.
I think we've been going around in circles long enough.

Let's leave things as they are and perhaps a section to the codecs
documentation, as you suggest, where to find other encodings which
a user might want to use and tools to help with fixing encoding or
decoding errors.

Here's a random list from PyPI with some packages:
https://pypi.python.org/pypi/ebcdic/
https://pypi.python.org/pypi/latexcodec/
https://pypi.python.org/pypi/mysql-latin1-codec/
https://pypi.python.org/pypi/cbmcodecs/

Perhaps fun variants such as:
https://pypi.python.org/pypi/emoji-encoding/

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 05 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/