converting to and from octal escaped UTF--8

Michael Goerz answer654 at 8439.e4ward.com
Mon Dec 3 09:10:50 EST 2007


MonkeeSage wrote:
> On Dec 3, 1:31 am, MonkeeSage <MonkeeS... at gmail.com> wrote:
>> On Dec 2, 11:46 pm, Michael Spencer <m... at telcopartners.com> wrote:
>>
>>
>>
>>> Michael Goerz wrote:
>>>> Hi,
>>>> I am writing unicode stings into a special text file that requires to
>>>> have non-ascii characters as as octal-escaped UTF-8 codes.
>>>> For example, the letter "Í" (latin capital I with acute, code point 205)
>>>> would come out as "\303\215".
>>>> I will also have to read back from the file later on and convert the
>>>> escaped characters back into a unicode string.
>>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
>>>> vice versa?
>>> Perhaps something along the lines of:
>>>   >>> def encode(source):
>>>   ...     return "".join("\%o" % ord(c) for c in source.encode('utf8'))
>>>   ...
>>>   >>> def decode(encoded):
>>>   ...     bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
>>>   ...     return bytes.decode('utf8')
>>>   ...
>>>   >>> encode(u"Í")
>>>   '\\303\\215'
>>>   >>> print decode(_)
>>>   Í
>>> HTH
>>> Michael
>> Nice one. :) If I might suggest a slight variation to handle cases
>> where the "encoded" string contains plain text as well as octal
>> escapes...
>>
>> def decode(encoded):
>>   for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
>>     encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
>>   return encoded.decode('utf8')
>>
>> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
>> as well as "adf\\303\\215adf".
>>
>> Regards,
>> Jordan
> 
> err...
> 
> def decode(encoded):
>   for octc in re.findall(r'\\(\d{3})', encoded):
>     encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
>   return encoded.decode('utf8')
Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
    encoded = ""
    for character in source:
        if (ord(character) < 32) or (ord(character) > 128):
            for byte in character.encode('utf8'):
                encoded += ("\%03o" % ord(byte))
        else:
            encoded += character
    return encoded.decode('utf-8')

def decode(encoded):
    decoded = encoded.encode('utf-8')
    for octc in re.findall(r'\\(\d{3})', decoded):
        decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return decoded.decode('utf8')


orig = u"blaÍblub" + chr(10)
enc  = encode(orig)
dec  = decode(enc)
print orig
print enc
print dec




More information about the Python-list mailing list