Fwd: Lossless bulletproof conversion to unicode (backslashing)

anatoly techtonik techtonik at gmail.com
Wed May 27 07:15:06 EDT 2015


Hi.

This was labelled offtopic in python-ideas, so I edited and forwarded
it here. Please CC as I am not subscribed.


In short. I need is a bulletproof way to convert from anything to
unicode. This requires some kind of escaping to go forward and back.
Some helper function like u2b() (unicode to binary) and b2u() (that
also removes escaping). So far I can't find any code that does just
that.


Background story. I need to print SCons graph. SCons is a build tool,
so it has a graph of nodes - what depends on what. I have no idea
what a node object could be. I know only that it can have human
readable representation. Sometimes node is a filename in some
encoding that is not utf-8, and without knowing the encoding,
converting it to unicode is not possible without loosing the information
about that filename.

So, here is what Python proposes:

https://docs.python.org/2.7/library/functions.html?highlight=unicode#unicode

unicode() type constructor that doesn't allow you to do conversion
without losing the data. It offers only two basic strategies - crash or
corrupt:

1. ignore  - meaning skip and corrupt the data
2. replace  - just corrupt the data
3. strict - just crash

Python design leaves the decision how to implement safe
interoperability to you, and that's basically the reason why Python 3
fails. Without a safe approach (get my binary data back frum that
unicode) people just can't wrap their heads around that.

Python design assumes that people know the encoding of data they
are processing, but that's not true in many cases. The data may also
be just broken or invalid. So, the real world coding assumptions are:

1. external data encoding is unknown or varies
2. external data has binary chunks that are invalid for
conversion to unicode

In real world UnicodeDecode crashes is not an option for deal with
unknown or broken and invalid input (such as when I need to print
human representation of Node to the screen). In many (most?)
situations lossless garbage is more welcome than crash or dataloss
and that should be a default behaviour.


The solution is to have filter preprocess the binary string to escape all
non-unicode symbols so that the following lossless transformation
becomes possible:

   binary -> escaped utf-8 string -> unicode -> binary

I want to know if that's real? I need to accomplish that with
Python 2.x, but the use case is probably valid for Python 3 as well.

This stuff is critical to port SCons to Python 3.x and I expect for other
similar tools that have to deal with unknown ascii-binary strings too.

-- 
anatoly t.



More information about the Python-list mailing list