[Python-ideas] Support WHATWG versions of legacy encodings

Rob Speer rspeer at luminoso.com
Wed Jan 10 14:13:39 EST 2018


I was originally proposing these encodings under different names, and
that's what I think they should have. Indeed, that helps because a pip
installable library can backport the new encodings to previous versions of
Python.

Having a pip installable library as the _only_ way to use these encodings
is the status quo that I am very familiar with. It's awkward. To use a
package that registers new codecs, you have to import something from that
package, even if you never call anything from what you imported, and that
makes flake8 complain. The idea that an encoding name may or may not be
registered, based on what has been imported, breaks our intuition about
reading Python code and is very hard to statically analyze.

I disagree with calling the WHATWG encodings that are implemented in every
Web browser "non-standard". WHATWG may not have a typical origin story as a
standards organization, but it _is_ the standards organization for the Web.

I'm really not interested in best-fit mappings that turn infinity into "8"
and square roots into "v". Making weird mappings like that sounds like a
job for the "unidecode" library, not the stdlib.

On Wed, 10 Jan 2018 at 13:36 Rob Speer <rspeer at luminoso.com> wrote:

> I'm looking at the documentation of "best fit" mappings, and that seems to
> be a different matter. It appears that best-fit mappings are designed to be
> many-to-one mappings used only for encoding.
>
> "Examples of best fit are converting fullwidth letters to their
> counterparts when converting to single byte code pages, and mapping the
> Infinity character to the number 8." (Mapping ∞ to 8? Seriously?!) It also
> does things such as mapping Cyrillic letters to Latin letters that look
> like them.
>
> This is not what I'm interested in implementing. I just want there to be
> encodings that match the WHATWG encodings exactly. If they have to be given
> a different name, that's fine with me.
>
> On Wed, 10 Jan 2018 at 03:38 M.-A. Lemburg <mal at egenix.com> wrote:
>
>> On 10.01.2018 00:56, Rob Speer wrote:
>> > Oh that's interesting. So it seems to be Python that's the exception
>> here.
>> >
>> > Would we really be able to add entries to character mappings that
>> haven't
>> > changed since Python 2.0?
>>
>> The Windows mappings in Python come directly from the Unicode
>> Consortium mapping files.
>>
>> If the Consortium changes the mappings, we can update them.
>>
>> If not, then we have a problem, since consumers are not only
>> the win32 APIs, but also other tools out there running on
>> completely different platforms, e.g. Java tools or web servers
>> providing downloads using the Windows code page encodings.
>>
>> Allowing such mappings in the existing codecs would then result
>> failures when the "other" sides see the decoded Unicode version and
>> try to encode back into the original encoding - you'd move the
>> problem from the Python side to the "other" side of the
>> integration.
>>
>> I had a look on the Unicode FTP site and they have since added
>> a new directory with mapping files they call "best fit":
>>
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
>>
>> The WideCharToMultiByte() defaults to best fit, but also offers
>> a mode where it operates in standards compliant mode:
>>
>>
>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>
>> See flag WC_NO_BEST_FIT_CHARS.
>>
>> Unicode TR#22 is also clear on this:
>>
>> https://www.unicode.org/reports/tr22/tr22-3.html#Illegal_and_Unassigned
>>
>> It allows such best fit mappings to make encodings round-trip
>> safe, but requires to keep these separate from the original
>> standard mappings:
>>
>> """
>> It is very important that systems be able to distinguish between the
>> fallback mappings and regular mappings. Systems like XML require the use
>> of hex escape sequences (NCRs) to preserve round-trip integrity; use of
>> fallback characters in that case corrupts the data.
>> """
>>
>> If you read the above section in TR#22 you quickly get reminded
>> of what the Unicode error handlers do (we basically implement
>> the three modes it mentions... raise, ignore, replace).
>>
>> Now, for unmapped sequences an error handler can opt for
>> using a fallback sequence instead.
>>
>> So in addition to adding best fit codecs, there's also the
>> option to add an error handler for best fit resolution of
>> unmapped sequences.
>>
>> Given the above, I don't think we ought to change the existing
>> standards compliant mappings, but use one of two solutions:
>>
>> a) add "best fit" encodings (see the Unicode FTP site for
>>    a list)
>>
>> b) add an error handlers "bestfit" which implements the
>>    fallback modes for the encodings in question
>>
>>
>> > On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas <
>> > python-ideas at python.org> wrote:
>> >
>> >> First of all, many thanks for such a excellently writen letter. It was
>> a
>> >> real pleasure to read.
>> >> On 10.01.2018 0:15, Rob Speer wrote:
>> >>
>> >> Hi! I joined this list because I'm interested in filling a gap in
>> Python's
>> >> standard library, relating to text encodings.
>> >>
>> >> There is an encoding with no name of its own. It's supported by every
>> >> current web browser and standardized by WHATWG. It's so prevalent that
>> if
>> >> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you
>> will
>> >> get this encoding _instead_. It is probably the second or third most
>> common
>> >> text encoding in the world. And Python doesn't quite support it.
>> >>
>> >> You can see the character table for this encoding at:
>> >> https://encoding.spec.whatwg.org/index-windows-1252.txt
>> >>
>> >> For the sake of discussion, let's call this encoding "web-1252". WHATWG
>> >> calls it "windows-1252", but notice that it's subtly different from
>> >> Python's "windows-1252" encoding. Python's windows-1252 has bytes that
>> are
>> >> undefined:
>> >>
>> >>>>> b'\x90'.decode('windows-1252')
>> >> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position
>> 0:
>> >> character maps to <undefined>
>> >>
>> >> In web-1252, the bytes that are undefined according to windows-1252
>> map to
>> >> the control characters in those positions in iso-8859-1 -- that is, the
>> >> Unicode codepoints with the same number as the byte. In web-1252,
>> b'\x90'
>> >> would decode as '\u0090'.
>> >>
>> >> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does
>> >> the same:
>> >>
>> >>     "According to the information on Microsoft's and the Unicode
>> >> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused;
>> >> however, the Windows API MultiByteToWideChar
>> >> <
>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx
>> >
>> >> maps these to the corresponding C1 control codes
>> >> <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>."
>> >> And in ISO-8859-1, the same handling is done for unused code points
>> even
>> >> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
>> >>
>> >>     "*ISO-8859-1* is the IANA
>> >> <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority>
>> >> preferred name for this standard when supplemented with the C0 and C1
>> >> control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>
>> >> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>"
>> >> And what would you think -- these "C1 control codes" are also the
>> >> corresponding Unicode points! (
>> >> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
>> >>
>> >> Since Windows is pretty much the reference implementation for
>> >> "windows-xxxx" encodings, it even makes sense to alter the existing
>> >> encodings rather than add new ones.
>> >>
>> >>
>> >> This may seem like a silly encoding that encourages doing horrible
>> things
>> >> with text. That's pretty much the case. But there's a reason every Web
>> >> browser implements it:
>> >>
>> >> - It's compatible with windows-1252
>> >> - Any sequence of bytes can be round-tripped through it without losing
>> >> information
>> >>
>> >> It's not just this one encoding. WHATWG's encoding standard (
>> >> https://encoding.spec.whatwg.org/) contains modified versions of
>> >> windows-1250 through windows-1258 and windows-874.
>> >>
>> >> Support for these encodings matters to me, in part, because I maintain
>> a
>> >> Unicode data-cleaning library, "ftfy". One thing it does is to detect
>> and
>> >> undo encoding/decoding errors that cause mojibake, as long as they're
>> >> detectible and reversible. Looking at real-world examples of text that
>> has
>> >> been damaged by mojibake, it's clear that lots of text is transferred
>> >> through what I'm calling the "web-1252" encoding, in a way that's
>> >> incompatible with Python's "windows-1252".
>> >>
>> >> In order to be able to work with and fix this kind of text, ftfy
>> registers
>> >> new codecs -- and I implemented this even before I knew that they were
>> >> standardized in Web browsers. When ftfy is imported, you can decode
>> text as
>> >> "sloppy-windows-1252" (the name I chose for this encoding), for
>> example.
>> >>
>> >> ftfy can tell people a sequence of steps that they can use in the
>> future
>> >> to fix text that's like the text they provided. Very often, these steps
>> >> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which
>> >> means the steps only work with ftfy imported, even for people who are
>> not
>> >> using the features of ftfy.
>> >>
>> >> Support for these encodings also seems highly relevant to people who
>> use
>> >> Python for web scraping, as it would be desirable to maximize
>> compatibility
>> >> with what a Web browser would do.
>> >>
>> >> This really seems like it belongs in the standard library instead of
>> being
>> >> an incidental feature of my library. I know that code in the standard
>> >> library has "one foot in the grave". I _want_ these legacy encodings to
>> >> have one foot in the grave. But some of them are extremely common, and
>> >> Python code should be able to deal with them.
>> >>
>> >> Adding these encodings to Python would be straightforward to implement.
>> >> Does this require a PEP, a pull request, or further discussion?
>> >>
>> >>
>> >> _______________________________________________
>> >> Python-ideas mailing listPython-ideas at python.orghttps://
>> mail.python.org/mailman/listinfo/python-ideas
>> >> Code of Conduct: http://python.org/psf/codeofconduct/
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Ivan
>> >>
>> >> _______________________________________________
>> >> Python-ideas mailing list
>> >> Python-ideas at python.org
>> >> https://mail.python.org/mailman/listinfo/python-ideas
>> >> Code of Conduct: http://python.org/psf/codeofconduct/
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Python-ideas mailing list
>> > Python-ideas at python.org
>> > https://mail.python.org/mailman/listinfo/python-ideas
>> > Code of Conduct: http://python.org/psf/codeofconduct/
>> >
>>
>> --
>> Marc-Andre Lemburg
>> eGenix.com
>>
>> Professional Python Services directly from the Experts (#1, Jan 10 2018)
>> >>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>> >>> Python Database Interfaces ...           http://products.egenix.com/
>> >>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
>> ________________________________________________________________________
>>
>> ::: We implement business ideas - efficiently in both time and costs :::
>>
>>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>            Registered at Amtsgericht Duesseldorf: HRB 46611
>>                http://www.egenix.com/company/contact/
>>                       http://www.malemburg.com/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180110/58940298/attachment-0001.html>


More information about the Python-ideas mailing list