Fwd: Lossless bulletproof conversion to unicode (backslashing)

Wed May 27 08:47:46 EDT 2015

On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote:

> Hi.
> 
> This was labelled offtopic in python-ideas, so I edited and forwarded
> it here. Please CC as I am not subscribed.
> 
> 
> In short. I need is a bulletproof way to convert from anything to
> unicode. This requires some kind of escaping to go forward and back.

Why do you need to go back? Just keep the node, and use that.

> Some helper function like u2b() (unicode to binary) and b2u() (that
> also removes escaping). So far I can't find any code that does just
> that.

def bytes2unicode(bytes):
    # Converts bytes to Unicode, allowing garbage (moji-bake).
    return bytes.decode('latin1')

def unicode2bytes(unicode):
    # Convert unicode containing garbage (moji-bake) to bytes.
    return unicode.encode('latin1')

It correctly does the round trip from any sequence of bytes to unicode and
back to bytes, losslessly:

py> import random
py> node = bytes([random.randrange(0, 256) for _ in range(100000)])
py> uni = bytes2unicode(node)
py> b = unicode2bytes(uni)
py> b == node
True

But take careful note that you can't start with Unicode and still expect to
round-trip losslessly. Many perfectly readable Unicode strings do *not*
convert to bytes:

py> unicode2bytes(u'ДЙ')  # two Cyrillic letters
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in unicode2bytes
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
ordinal not in range(256)

That means that if you take a correctly encoded string, it will round-trip,
but it will also display as garbage:

py> s = u'ДЙ'
py> node = s.encode('utf-8')
py> print(node)  # Correctly encoded UTF-8
b'\xd0\x94\xd0\x99'
py> node == unicode2bytes(bytes2unicode(node))  # round trips okay
True
py> print(repr(bytes2unicode(node)))  # but prints as crap
'Ð\x94Ð\x99'

> Background story. I need to print SCons graph. SCons is a build tool,
> so it has a graph of nodes - what depends on what. I have no idea
> what a node object could be. I know only that it can have human
> readable representation. Sometimes node is a filename in some
> encoding that is not utf-8, and without knowing the encoding,
> converting it to unicode is not possible without loosing the information
> about that filename.

py> filename = "My Russian ДЙ name"  # Unicode
py> b = filename.encode('koi8-r')  # Oops, not UTF-8!
py> b.decode("utf-8")  # Fails
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11:
invalid continuation byte
py> b.decode("utf-8", errors="replace")  # lossy, but works
'My Russian �� name'
py> s = b.decode("utf-8", errors="surrogateescape")  # magic!
py> s
'My Russian \udce4\udcea name'

It round-trips as well:

py> s.encode("utf-8", errors="surrogateescape") == b
True

Converting this back to Python 2.7 is left as an exercise for the reader.

-- 
Steven