UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
M.-A. Lemburg
mal@lemburg.com
Tue, 16 Nov 1999 11:42:19 +0100
Tim Peters wrote:
>
> [MAL, on raw Unicode strings]
> > ...
> > Agreed... note that you could also write your own codec for just this
> > reason and then use:
> >
> > u = unicode('....\u1234...\...\...','raw-unicode-escaped')
> >
> > Put that into a function called 'ur' and you have:
> >
> > u = ur('...\u4545...\...\...')
> >
> > which is not that far away from ur'...' w/r to cosmetics.
>
> Well, not quite. In general you need to pass raw strings:
>
> u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
> ^
> u = ur(r'...\u4545...\...\...')
> ^
>
> else Python will replace all the other backslash sequences. This is a
> crucial distinction at times; e.g., else \b in a Unicode regexp will expand
> into a backspace character before the regexp processor ever sees it (\b is
> supposed to be a word boundary assertion).
Right.
Here is a sample implementation of what I had in mind:
""" Demo for 'unicode-escape' encoding.
"""
import struct,string,re
pack_format = '>H'
def convert_string(s):
l = map(None,s)
for i in range(len(l)):
l[i] = struct.pack(pack_format,ord(l[i]))
return l
u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')
def unicode_unescape(s):
l = []
start = 0
while start < len(s):
m = u_escape.search(s,start)
if not m:
l[len(l):] = convert_string(s[start:])
break
m_start,m_end = m.span()
if m_start > start:
l[len(l):] = convert_string(s[start:m_start])
hexcode = m.group(1)
#print hexcode,start,m_start
if len(hexcode) != 4:
raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
ordinal = string.atoi(hexcode,16)
l.append(struct.pack(pack_format,ordinal))
start = m_end
#print l
return string.join(l,'')
def hexstr(s,sep=''):
return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep)
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 45 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/