Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

Fri May 29 04:19:14 EDT 2015

On Fri, May 29, 2015 at 6:05 PM, anatoly techtonik <techtonik at gmail.com> wrote:
>> On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik <techtonik at gmail.com> wrote:
>>> And the short answer is that we need unicode because we are printing this
>>> information to the stdout, and stdout is opened in text mode at least on
>>> Windows, and without explicit conversion, Python will try to decode stuff
>>> as being `ascii` and fail anyway.
>>
>> So you're working with text.
>
> No. It is unknown.
>
> I am printing Nodes of SCons build graph and I don't know how Nodes are
> represented. In my case it appeared that Node contained Russian text, which
> led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or in
> KOI-8 and I can't do guessing of all possible encodings there. I just need to
> print that tree without crash or information loss.

You're saying it's text, but you don't know the encoding. You're
trying to display bytes as if they're text, but fundamentally, you're
trying to work with text.

>> That means you HAVE to decode it somehow;
>> you fundamentally cannot print bytes to the console. Lossless
>> concealment of arbitrary bytes won't help you.
>
> Won't help me with what? I am debugging build scripts to find out the
> *structure* of my dependencies and then all of the sudden Python crashes
> with UnicodeDecode error leaving me pronouncing bad Russian curses
> aloud.

Your fundamental problem is not the UnicodeDecodeError, but the
unknown encoding. What you're seeing is that Python refuses to be
sloppy.

>> If you can't adequately
>> decode everything, either backslash-escape the rest, or use a
>> replacement character; you can't print out those bytes.
>
> Yes. How to backslash the rest in Python 2? In Python 3 there is
> some freaky "surrogateescape" error strategy, but what to do in
> Python 2?

Not sure what's so freaky about it. But hey. If Python 2 can't do what
you want, is it so hard to use Python 3? Unicode support really is
better. Alternatively, just do something like this:

b = "some arbitrary byte string that you got from somewhere"
try:
    text = b.decode("utf-8")
except UnicodeDecodeError:
    text = repr(b).decode("ascii")

The repr of a byte string in Py2 should be a safe way to display
arbitrary bytes, without data loss. It will expand the string
significantly (four characters for one \xNN escape, plus adding
backslashes to everything else that needs them), but it does guarantee
safety.

> Replacement character is not a solution, because it is a data loss,
> and if I want to do post processing of graph log, I won't be able to
> recover the missing bits.
>
>> And no, I will not cc you. Subscribe to the list if you're going to
>> ask a question.
>
> Added Mailman to my suxx tracker:
> https://github.com/techtonik/suxx-tracker#mailman

Why? You're trying to fire questions out to a community without being
a part of that community. Why is that the software's problem?

You can either subscribe to the list/ng or follow via some web
interface, but it's unreasonable to ask everyone to cc you. Imagine if
we _did_ all cc you, but we also cc you in on an entire sub-thread
that you're not interested in. Or maybe half of us do and half don't.
What then? You don't get any sort of control over what you get copies
of. Is that really what you want?

ChrisA