UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

Tim Peters tim_one@email.msn.com
Tue, 16 Nov 1999 00:38:40 -0500


[MAL, on raw Unicode strings]
> ...
> Agreed... note that you could also write your own codec for just this
> reason and then use:
>
> u = unicode('....\u1234...\...\...','raw-unicode-escaped')
>
> Put that into a function called 'ur' and you have:
>
> u = ur('...\u4545...\...\...')
>
> which is not that far away from ur'...' w/r to cosmetics.

Well, not quite.  In general you need to pass raw strings:

u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
            ^
u = ur(r'...\u4545...\...\...')
       ^

else Python will replace all the other backslash sequences.  This is a
crucial distinction at times; e.g., else \b in a Unicode regexp will expand
into a backspace character before the regexp processor ever sees it (\b is
supposed to be a word boundary assertion).