Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

anatoly techtonik techtonik at gmail.com
Fri May 29 04:05:07 EDT 2015


On Wed, May 27, 2015 at 3:57 PM, Laura Creighton <lac at openend.se> wrote:
> ------- Forwarded Message
>
> Return-Path: <python-list-bounces+lac=openend.se at python.org>
> Received: from mail.python.org (mail.python.org [82.94.164.166])
>         by theraft.openend.se (8.14.4/8.14.4/Debian-4) with ESMTP id t4RC09ap02From: Chris Angelico <rosuav at gmail.com>
> Cc: "python-list at python.org" <python-list at python.org>
>
>
> On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik <techtonik at gmail.com> wrote:
>> And the short answer is that we need unicode because we are printing this
>> information to the stdout, and stdout is opened in text mode at least on
>> Windows, and without explicit conversion, Python will try to decode stuff
>> as being `ascii` and fail anyway.
>
> So you're working with text.

No. It is unknown.

I am printing Nodes of SCons build graph and I don't know how Nodes are
represented. In my case it appeared that Node contained Russian text, which
led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or in
KOI-8 and I can't do guessing of all possible encodings there. I just need to
print that tree without crash or information loss.

> That means you HAVE to decode it somehow;
> you fundamentally cannot print bytes to the console. Lossless
> concealment of arbitrary bytes won't help you.

Won't help me with what? I am debugging build scripts to find out the
*structure* of my dependencies and then all of the sudden Python crashes
with UnicodeDecode error leaving me pronouncing bad Russian curses
aloud.

It is not even less forgiving than Java, but is also more treacherous,
because of its run-time nature.

It will surely help to preserve my zen if Python could just flow through
the nodes of this graph. Garbage is okay - I can clean it up or remove if it
stands in the way, just disrupt my flow or say me that now I want to deal
with UnicodeDecode errors. Because I don't.

> If you can't adequately
> decode everything, either backslash-escape the rest, or use a
> replacement character; you can't print out those bytes.

Yes. How to backslash the rest in Python 2? In Python 3 there is
some freaky "surrogateescape" error strategy, but what to do in
Python 2?

Replacement character is not a solution, because it is a data loss,
and if I want to do post processing of graph log, I won't be able to
recover the missing bits.

> And no, I will not cc you. Subscribe to the list if you're going to
> ask a question.

Added Mailman to my suxx tracker:
https://github.com/techtonik/suxx-tracker#mailman

-- 
anatoly t.



More information about the Python-list mailing list