[Python-ideas] Support WHATWG versions of legacy encodings

Random832 random832 at fastmail.com
Fri Jan 19 11:24:48 EST 2018


On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
> > Someone did discover that Microsoft's current implementations of the
> > windows-* encodings matches the WHAT-WG spec, rather than the Unicode
> > spec that Microsoft originally wrote.
> 
> No, MS implements somethings called "best fit encodings"
> and these are different than what WHATWG uses.

NO. I made this absolutely clear in my previous message, best fit mappings can be clearly distinguished from regular mappings by the behavior of the native conversion functions with certain argument flags (the mapping of 0xA0 to some private use character in cp932, for example, is a best-fit mapping in the decoding direction - but is treated as a regular mapping for encoding purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit mapping or in any way different from the rest of the mappings.

We are not talking about implementing the best fit mappings. We are talking about real regular mappings that actually exist in these codepages that were for some unknown reason not included in the files published by Unicode.

> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
> 
> unfortunately uses the above mentioned best fit encodings,
> but this can and should be switched off by specifying the
> WC_NO_BEST_FIT_CHARS for anything that requires validation
> or needs to be interoperable:

Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact does not disable the mappings we are discussing.


More information about the Python-ideas mailing list