[Python-ideas] Support WHATWG versions of legacy encodings
Guido van Rossum
guido at python.org
Fri Jan 19 11:20:26 EST 2018
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg <mal at egenix.com> wrote:
> On 19.01.2018 05:38, Nathaniel Smith wrote:
> > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <guido at python.org>
> wrote:
> >> Can someone explain to me why this is such a controversial issue?
> >
> > I guess practicality versus purity is always controversial :-)
> >
> >> It seems reasonable to me to add new encodings to the stdlib that do the
> >> roundtripping requested in the first message of the thread. As long as
> they
> >> have new names that seems to fall under "practicality beats purity".
>
> There are a few issues here:
>
> * WHATWG encodings are mostly for decoding content in order to
> show it in the browser, accepting broken encoding data.
>
And sometimes Python apps that pull data from the web.
> Python already has support for this by using one of the available
> error handlers, or adding new ones to suit the needs.
>
This seems cumbersome though.
> If we'd add the encodings, people will start creating more
> broken data, since this is what the WHATWG codecs output
> when encoding Unicode.
>
That's FUD. Only apps that specifically use the new WHATWG encodings would
be able to consume that data. And surely the practice of web browsers will
have a much bigger effect than Python's choice.
> As discussed, this could be addressed by making the WHATWG
> codecs decode-only.
>
But that would defeat the point of roundtripping, right?
> * The use case seems limited to implementing browsers or headless
> implementations working like browsers.
>
> That's not really general enough to warrant adding lots of
> new codecs to the stdlib. A PyPI package is better suited
> for this.
>
Perhaps, but such a package already exists and its author (who surely has
read a lot of bug reports from its users) says that this is cumbersome.
> * The WHATWG codecs do not only cover simple mapping codecs,
> but also many multi-byte ones for e.g. Asian languages.
>
> I doubt that we'd want to maintain such codecs in the stdlib,
> since this will increase the download sizes of the installers
> and also require people knowledgeable about these variants
> to work on them and fix any issues.
>
Really? Why is adding a bunch of codecs so much effort? Surely the
translation tables contain data that compresses well? And surely we don't
need a separate dedicated piece of C code for each new codec?
> Overall, I think either pointing people to error handlers
> or perhaps adding a new one specifically for the case of
> dealing with control character mappings would provide a better
> maintenance / usefulness ratio than adding lots of new
> legacy codecs to the stdlib.
>
Wouldn't error handlers be much slower? And to me it seems a new error
handler is a much *bigger* deal than some new encodings -- error handlers
must work for *all* encodings.
> BTW: WHATWG pushes for always using UTF-8 as far as I can tell
> from their website.
>
As does Python. But apparently it will take decades more to get there.
> >> (Modifying existing encodings seems wrong -- did the feature request
> somehow
> >> transmogrify into that?)
> >
> > Someone did discover that Microsoft's current implementations of the
> > windows-* encodings matches the WHAT-WG spec, rather than the Unicode
> > spec that Microsoft originally wrote.
>
> No, MS implements somethings called "best fit encodings"
> and these are different than what WHATWG uses.
>
> Unlike the WHATWG encodings, these are documented as vendor encodings
> on the Unicode site, which is what we normally use as reference
> for out stdlib codecs.
>
> However, whether these are actually a good idea, is open to discussion
> as well, since they sometimes go a bit far with "best fit", e.g.
> mapping the infinity symbol to 8.
>
> Again, using the error handles we have for dealing with
> situations which require non-standard encoding behavior are
> the better approach:
>
> https://docs.python.org/3.7/library/codecs.html#error-handlers
>
> Adding new ones is possible as well.
>
> > So there is some argument that
> > the Python's existing encodings are simply out of date, and changing
> > them would be a bugfix. (And standards aside, it is surely going to be
> > somewhat error-prone if Python's windows-1252 doesn't match everyone
> > else's implementations of windows-1252.) But yeah, AFAICT the original
> > requesters would be happy either way; they just want it available
> > under some name.
>
> The encodings are not out of date. I don't know where you got
> that impression from.
>
> The Windows API WideCharToMultiByte which was quoted in the discussion:
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/
> dd374130%28v=vs.85%29.aspx
>
> unfortunately uses the above mentioned best fit encodings,
> but this can and should be switched off by specifying the
> WC_NO_BEST_FIT_CHARS for anything that requires validation
> or needs to be interoperable:
>
> """
> For strings that require validation, such as file, resource, and user
> names, the application should always use the WC_NO_BEST_FIT_CHARS flag.
> This flag prevents the function from mapping characters to characters
> that appear similar but have very different semantics. In some cases,
> the semantic change can be extreme. For example, the symbol for "∞"
> (infinity) maps to 8 (eight) in some code pages.
> """
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Jan 19 2018)
> >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/
> >>> Python Database Interfaces ... http://products.egenix.com/
> >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/
> ________________________________________________________________________
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
> eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> http://www.egenix.com/company/contact/
> http://www.malemburg.com/
>
>
--
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180119/60e5d793/attachment-0001.html>
More information about the Python-ideas
mailing list