UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
Tim Peters
tim_one@email.msn.com
Tue, 16 Nov 1999 00:38:40 -0500
[MAL, on raw Unicode strings]
> ...
> Agreed... note that you could also write your own codec for just this
> reason and then use:
>
> u = unicode('....\u1234...\...\...','raw-unicode-escaped')
>
> Put that into a function called 'ur' and you have:
>
> u = ur('...\u4545...\...\...')
>
> which is not that far away from ur'...' w/r to cosmetics.
Well, not quite. In general you need to pass raw strings:
u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
^
u = ur(r'...\u4545...\...\...')
^
else Python will replace all the other backslash sequences. This is a
crucial distinction at times; e.g., else \b in a Unicode regexp will expand
into a backspace character before the regexp processor ever sees it (\b is
supposed to be a word boundary assertion).