Fwd: Lossless bulletproof conversion to unicode (backslashing)
Steven D'Aprano
steve at pearwood.info
Wed May 27 08:47:46 EDT 2015
On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote:
> Hi.
>
> This was labelled offtopic in python-ideas, so I edited and forwarded
> it here. Please CC as I am not subscribed.
>
>
> In short. I need is a bulletproof way to convert from anything to
> unicode. This requires some kind of escaping to go forward and back.
Why do you need to go back? Just keep the node, and use that.
> Some helper function like u2b() (unicode to binary) and b2u() (that
> also removes escaping). So far I can't find any code that does just
> that.
def bytes2unicode(bytes):
# Converts bytes to Unicode, allowing garbage (moji-bake).
return bytes.decode('latin1')
def unicode2bytes(unicode):
# Convert unicode containing garbage (moji-bake) to bytes.
return unicode.encode('latin1')
It correctly does the round trip from any sequence of bytes to unicode and
back to bytes, losslessly:
py> import random
py> node = bytes([random.randrange(0, 256) for _ in range(100000)])
py> uni = bytes2unicode(node)
py> b = unicode2bytes(uni)
py> b == node
True
But take careful note that you can't start with Unicode and still expect to
round-trip losslessly. Many perfectly readable Unicode strings do *not*
convert to bytes:
py> unicode2bytes(u'ДЙ') # two Cyrillic letters
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in unicode2bytes
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1:
ordinal not in range(256)
That means that if you take a correctly encoded string, it will round-trip,
but it will also display as garbage:
py> s = u'ДЙ'
py> node = s.encode('utf-8')
py> print(node) # Correctly encoded UTF-8
b'\xd0\x94\xd0\x99'
py> node == unicode2bytes(bytes2unicode(node)) # round trips okay
True
py> print(repr(bytes2unicode(node))) # but prints as crap
'Ð\x94Ð\x99'
> Background story. I need to print SCons graph. SCons is a build tool,
> so it has a graph of nodes - what depends on what. I have no idea
> what a node object could be. I know only that it can have human
> readable representation. Sometimes node is a filename in some
> encoding that is not utf-8, and without knowing the encoding,
> converting it to unicode is not possible without loosing the information
> about that filename.
py> filename = "My Russian ДЙ name" # Unicode
py> b = filename.encode('koi8-r') # Oops, not UTF-8!
py> b.decode("utf-8") # Fails
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11:
invalid continuation byte
py> b.decode("utf-8", errors="replace") # lossy, but works
'My Russian �� name'
py> s = b.decode("utf-8", errors="surrogateescape") # magic!
py> s
'My Russian \udce4\udcea name'
It round-trips as well:
py> s.encode("utf-8", errors="surrogateescape") == b
True
Converting this back to Python 2.7 is left as an exercise for the reader.
--
Steven
More information about the Python-list
mailing list