UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

M.-A. Lemburg mal@lemburg.com
Tue, 16 Nov 1999 11:42:19 +0100


Tim Peters wrote:
> 
> [MAL, on raw Unicode strings]
> > ...
> > Agreed... note that you could also write your own codec for just this
> > reason and then use:
> >
> > u = unicode('....\u1234...\...\...','raw-unicode-escaped')
> >
> > Put that into a function called 'ur' and you have:
> >
> > u = ur('...\u4545...\...\...')
> >
> > which is not that far away from ur'...' w/r to cosmetics.
> 
> Well, not quite.  In general you need to pass raw strings:
> 
> u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
>             ^
> u = ur(r'...\u4545...\...\...')
>        ^
> 
> else Python will replace all the other backslash sequences.  This is a
> crucial distinction at times; e.g., else \b in a Unicode regexp will expand
> into a backspace character before the regexp processor ever sees it (\b is
> supposed to be a word boundary assertion).

Right.

Here is a sample implementation of what I had in mind:

""" Demo for 'unicode-escape' encoding.
"""
import struct,string,re

pack_format = '>H'

def convert_string(s):

    l = map(None,s)
    for i in range(len(l)):
	l[i] = struct.pack(pack_format,ord(l[i]))
    return l

u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')

def unicode_unescape(s):

    l = []
    start = 0
    while start < len(s):
	m = u_escape.search(s,start)
	if not m:
	    l[len(l):] = convert_string(s[start:])
	    break
	m_start,m_end = m.span()
	if m_start > start:
	    l[len(l):] = convert_string(s[start:m_start])
	hexcode = m.group(1)
	#print hexcode,start,m_start
	if len(hexcode) != 4:
	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
	ordinal = string.atoi(hexcode,16)
	l.append(struct.pack(pack_format,ordinal))
	start = m_end
    #print l
    return string.join(l,'')
    
def hexstr(s,sep=''):

    return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/