[Python-ideas] Support WHATWG versions of legacy encodings

Guido van Rossum guido at python.org
Fri Jan 19 11:20:26 EST 2018


On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal at egenix.com> wrote:

> On 19.01.2018 05:38, Nathaniel Smith wrote:
> > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido at python.org>
> wrote:
> >> Can someone explain to me why this is such a controversial issue?
> >
> > I guess practicality versus purity is always controversial :-)
> >
> >> It seems reasonable to me to add new encodings to the stdlib that do the
> >> roundtripping requested in the first message of the thread. As long as
> they
> >> have new names that seems to fall under "practicality beats purity".
>
> There are a few issues here:
>
> * WHATWG encodings are mostly for decoding content in order to
>   show it in the browser, accepting broken encoding data.
>

And sometimes Python apps that pull data from the web.


>   Python already has support for this by using one of the available
>   error handlers, or adding new ones to suit the needs.
>

This seems cumbersome though.


>   If we'd add the encodings, people will start creating more
>   broken data, since this is what the WHATWG codecs output
>   when encoding Unicode.
>

That's FUD. Only apps that specifically use the new WHATWG encodings would
be able to consume that data. And surely the practice of web browsers will
have a much bigger effect than Python's choice.


>   As discussed, this could be addressed by making the WHATWG
>   codecs decode-only.
>

But that would defeat the point of roundtripping, right?


> * The use case seems limited to implementing browsers or headless
>   implementations working like browsers.
>
>   That's not really general enough to warrant adding lots of
>   new codecs to the stdlib. A PyPI package is better suited
>   for this.
>

Perhaps, but such a package already exists and its author (who surely has
read a lot of bug reports from its users) says that this is cumbersome.


> * The WHATWG codecs do not only cover simple mapping codecs,
>   but also many multi-byte ones for e.g. Asian languages.
>
>   I doubt that we'd want to maintain such codecs in the stdlib,
>   since this will increase the download sizes of the installers
>   and also require people knowledgeable about these variants
>   to work on them and fix any issues.
>

Really? Why is adding a bunch of codecs so much effort? Surely the
translation tables contain data that compresses well? And surely we don't
need a separate dedicated piece of C code for each new codec?


> Overall, I think either pointing people to error handlers
> or perhaps adding a new one specifically for the case of
> dealing with control character mappings would provide a better
> maintenance / usefulness ratio than adding lots of new
> legacy codecs to the stdlib.
>

Wouldn't error handlers be much slower? And to me it seems a new error
handler is a much *bigger* deal than some new encodings -- error handlers
must work for *all* encodings.


> BTW: WHATWG pushes for always using UTF-8 as far as I can tell
> from their website.
>

As does Python. But apparently it will take decades more to get there.


> >> (Modifying existing encodings seems wrong -- did the feature request
> somehow
> >> transmogrify into that?)
> >
> > Someone did discover that Microsoft's current implementations of the
> > windows-* encodings matches the WHAT-WG spec, rather than the Unicode
> > spec that Microsoft originally wrote.
>
> No, MS implements somethings called "best fit encodings"
> and these are different than what WHATWG uses.
>
> Unlike the WHATWG encodings, these are documented as vendor encodings
> on the Unicode site, which is what we normally use as reference
> for out stdlib codecs.
>
> However, whether these are actually a good idea, is open to discussion
> as well, since they sometimes go a bit far with "best fit", e.g.
> mapping the infinity symbol to 8.
>
> Again, using the error handles we have for dealing with
> situations which require non-standard encoding behavior are
> the better approach:
>
> https://docs.python.org/3.7/library/codecs.html#error-handlers
>
> Adding new ones is possible as well.
>
> > So there is some argument that
> > the Python's existing encodings are simply out of date, and changing
> > them would be a bugfix. (And standards aside, it is surely going to be
> > somewhat error-prone if Python's windows-1252 doesn't match everyone
> > else's implementations of windows-1252.) But yeah, AFAICT the original
> > requesters would be happy either way; they just want it available
> > under some name.
>
> The encodings are not out of date. I don't know where you got
> that impression from.
>
> The Windows API WideCharToMultiByte  which was quoted in the discussion:
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/
> dd374130%28v=vs.85%29.aspx
>
> unfortunately uses the above mentioned best fit encodings,
> but this can and should be switched off by specifying the
> WC_NO_BEST_FIT_CHARS for anything that requires validation
> or needs to be interoperable:
>
> """
> For strings that require validation, such as file, resource, and user
> names, the application should always use the WC_NO_BEST_FIT_CHARS flag.
> This flag prevents the function from mapping characters to characters
> that appear similar but have very different semantics. In some cases,
> the semantic change can be extreme. For example, the symbol for "∞"
> (infinity) maps to 8 (eight) in some code pages.
> """
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jan 19 2018)
> >>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
> >>> Python Database Interfaces ...           http://products.egenix.com/
> >>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
> ________________________________________________________________________
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611
>                http://www.egenix.com/company/contact/
>                       http://www.malemburg.com/
>
>


-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180119/60e5d793/attachment-0001.html>


More information about the Python-ideas mailing list