[Python-ideas] Support WHATWG versions of legacy encodings

MRAB python at mrabarnett.plus.com
Thu Jan 11 16:09:22 EST 2018


On 2018-01-11 19:42, Rob Speer wrote:
>  > The question is rather: how often does web-XXX mojibake happen?
> 
> Very often. Particularly web-1252 mixed up with UTF-8.
> 
> My ftfy library is tested on data from Twitter and the Common Crawl, 
> both prime sources of mojibake. One common mojibake sequence is when a 
> right curly quote is encoded as UTF-8 and decoded as codepage 1252. In 
> Python's official windows-1252, this would at best be "�", using the 
> 'replace' error handler. In web-1252, this would be "â€\x9d". The 
> web-1252 version is more common.
> 
> Of course, since Python itself is widespread, there is some survivorship 
> bias here. Another thing you could get instead of "�" is your code 
> crashing.
> 
FWIW, I've occasionally seen that kind of mojibake on the news ticker of 
the BBC News channel. :-(

[snip]


More information about the Python-ideas mailing list