[melbourne-pug] Unicode for windows dummies

Tue Aug 16 03:12:40 EDT 2016

On 16 August 2016 at 15:28, Anthony Briggs <anthony.briggs at gmail.com> wrote:

>
>
> On 16 August 2016 at 14:57, William ML Leslie <
> william.leslie.ttg at gmail.com> wrote:
>
>> On 16 August 2016 at 14:40, Anthony Briggs <anthony.briggs at gmail.com>
>> wrote:
>> > print("M├┐ h├┤v├¿r├ºr├áft ├«├ƒ f├╗┼él ├Âf ├®├¬l┼ø")
>> >
>> > works just fine for me, since you're just printing an internal Python
>> > string.
>>
>> It will work fine unless you're on Mike's machine - if
>> sys.stdout.encoding is cp850 and you've got unicode_literals imported
>> (or are using python3), it won't.
>>
>
> That string is translated to a cp1252 character set, so I'd be surprised
> if it didn't work.
>
> OTOH, try utf-8 characters in a Windows Python REPL, and you don't even
> make it to the end of the string :)
>
> print("Mÿ hôvèrçràft îß fûll öf éêls")
>

All of those characters are represented in cp1252 and can print on a
windows terminal, but I think we're confusing two things here, so lets try
them both:

>>> s = b'M\xc3\xbf h\xc3\xb4v\xc3\xa8r\xc3\xa7r\xc3\xa0ft \xc3\xae\xc3\x9f
f\xc3\xbbll \xc3\xb6f \xc3\xa9\xc3\xaals'

Here, s is the text you sent in utf-8, in case my mail client gets confused.

>>> print(s.decode('utf-8'))
| Mÿ hôvèrçràft îß fûll öf éêls

This works, because s.decode('utf-8') is a valid text string, mappable by
cp1252.

>>> print(s.decode('cp1252'))
| MÃ¿ hÃ´vÃ¨rÃ§rÃ ft Ã®ÃŸ fÃ»ll Ã¶f Ã©Ãªls

This succeeds, as all possible bytes are mapped by cp1252.  However, it
prints nonsense.

This case is different, though.  In python3, reading from an open file will
give us text, and it happens that the text (from the default encoding) is
not representable in cp1252.  For an example,

>>> t = u'given \u2113\u2081 = 7'

>>> print(t)

This will not work on a machine with cp1252 as the codec.

>>> print(t.encode("cp1252", "replace").decode("cp1252"))
| given ?? = 7

Will replace the correct number of characters with qmarks, preserving the
structure of the text.

>>> print(t.encode("utf-8").decode("cp1252", "replace"))
| given â„“â‚� = 7

gives nonsense.

>
> >The problem is from trying to print a binary string (which is what
>> > you get from .encode()) as an internal Python string. If you specify an
>> > encoding, the error goes away:
>> >
>> > print("M├┐ h├┤v├¿r├ºr├áft ├«├ƒ f├╗┼él ├Âf
>> > ├®├¬l┼ø".encode("utf-8").decode("cp1252", "replace"))
>>
>> The only reason to encode to utf-8 and then decode from cp1252 is to
>> fix incorrect input.
>>
>> I think you mean .encode("cp1252", "replace").decode("cp1252")
>>
>
> No - the point was to get a binary string that doesn't translate nicely
> into cp1252, otherwise you don't need the 'replace' parameter. This is
> Mike's core problem - he's reading bytes from a utf-8 file, and trying to
> print that to the terminal.
>

First things first - mike isn't reading bytes.  open() in python 3.5 gives
text; but the text he gets is not representable in cp850.

All bytestrings "translate nicely" into cp1252 when you .decode("cp1252")
them, where by nicely I presume you mean not raising an exception, not
actually making sense.

.decode("cp1252") can /never/ fail when applied to a bytestring (so the
"replace" is redundant), and the result can never fail to encode to cp1252.

-- 
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered under
copyright law.  You absolutely MAY reproduce any part of it in accordance
with the copyright law of the nation you are reading this in.  Any attempt
to DENY YOU THOSE RIGHTS would be illegal without prior contractual
agreement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/melbourne-pug/attachments/20160816/a2ed2e26/attachment.html>