Unicode/UTF-8 confusion

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Sat Mar 15 12:57:19 EDT 2008


On Sat, 15 Mar 2008 12:09:19 -0400, Tom Stambaugh wrote:

> I'm still confused about this, even after days of hacking at it. It's time I 
> asked for help. I understand that each of you knows more about Python, 
> Javascript, unicode, and programming than me, and I understand that each of 
> you has a higher SAT score than me. So please try and be gentle with your 
> responses.
> 
> I use simplejson to serialize html strings that the server is delivering to 
> a browser. Since the apostrophe is a string terminator in javascript, I need 
> to escape any apostrophe embedded in the html.
> 
> Just to be clear, the specific unicode character I'm struggling with is 
> described in Python as:
> u'\N{APOSTROPHE}'}. It has a standardized utf-8 value (according to, for 
> example, http://www.fileformat.info/info/unicode/char/0027/index.htm) of 
> 0x27.
> 
> This can be expressed in several common ways:
> hex: 0x27
> Python literal: u"\u0027"
> 
> Suppose I start with some test string that contains an embedded 
> apostrophe -- for example: u"   '   ". I believe that the appropriate json 
> serialization of this is (presented as a list to eliminate notation 
> ambiguities):
> 
> ['"', ' ', ' ', ' ', '\\', '\\', '0', '0', '2', '7', ' ', ' ', ' ', '"']
>
> This is a 14-character utf-8 serialization of the above test string.
> 
> I know I can brute-force this, using something like the following:
> def encode(aRawString):
>     aReplacement = ''.join(['\\', '0', '0', '2', '7'])
>     aCookedString = aRawString.replace("'", aReplacement)
>     answer = simplejson.dumps(aCookedString)
>     return answer
> 
> I can't even make mailers let me *TYPE* a string literal for the replacement 
> string without trying to turn it into an HTML link!
> 
> Anyway, I know that my "encode" function works, but it pains me to add that 
> "replace" call before *EVERY* invocation of the simplejson.dumps() method. 
> The reason I upgraded to 1.7.4 was to get the c-level speedup routine now 
> offered by simplejson -- yet the need to do this apostrophe escaping seems 
> to negate this advantage! Is there perhaps some combination of dumps keyword 
> arguments, python encode()/str() magic, or something similar that 
> accomplishes this same result?
> 
> What is the highest-performance way to get simplejson to emit the desired 
> serialization of the given test string?

Somehow I don't get what you are after.  The ' doesn't have to be escaped
at all if " are used to delimit the string.  If ' are used as delimiters
then \' is a correct escaping.  What is the problem with that!?

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list