Anyway to designating the encoding of the "source" for compile?

John Machin sjmachin at lexicon.net
Mon May 16 20:33:35 EDT 2005


On 16 May 2005 16:44:30 -0700, janeaustine50 at hotmail.com wrote:

>
>janeaustin... at hotmail.com wrote:
>> John Machin ??:
>> > On 16 May 2005 10:15:22 -0700, janeaustine50 at hotmail.com wrote:
>> >
>> > >janeaustine50 at hotmail.com wrote:
>> > >> Python's InteractiveInterpreter uses the built-in compile
>> function.
>> > >>
>> > >> According to the ref. manual, it doesn't seem to concern about
>the
>> > >> encoding of the source string.
>> > >>
>> > >> When I hand in an unicode object, it is encoded in utf-8
>> > >automatically.
>> > >> It can be a problem when I'm building an interactive environment
>> > >using
>> > >> "compile", with a different encoding from utf-8.
>> >

==== This is *EXACTLY* what your problem is ====
>> > I don't understand this. Suppose your "different encoding" is
>cp125x
>> > (where x is a digit). Would you not do something like this?
>> >
>> > compile_input = user_input.decode('cp125x')
>> > code_object = compile(compile_input, ......
=================================================




==== It would have helped had you followed this ==========
>> > and when it comes to Unicode
>> > objects (indeed any text), show us repr(text) -- "what you see is
>> > often not what others see and often not what you've actually got".
===========================================================

>> Okay, I'll use one of the CJK codecs as the example. EUC-KR is the
>> default encoding.
>>
>> >>> import sys;sys.getdefaultencoding()
>> 'euc-kr'
>> >>> '??' 
# There's a very strong assumption that the above was originally
encoded in euc-kr but by the time I copied the 2 chars out of my
browser it was definitely Unicode. See what I mean about using repr()?

>> '\xc7\xd1\xb1\xdb'
>> >>> u'??'
>> u'\ud55c\uae00'
>> >>> s=compile("inside=u'??'",'','single')
>> >>> exec s
>> >>> inside #wrong

[big snip]

Like I said, *ALL* you have to do (like in any other Unicode-aware
app) is decode your user input into Unicode (you *don't* need to parse
bits and pieces of it) and feed it in ... like this:

>>> user_input_kr = "inside=u'\xc7\xd1\xb1\xdb'"
>>> user_input_uc = user_input_kr.decode('euc-kr')
>>> user_input_uc
u"inside=u'\ud55c\uae00'"
>>> s = compile(user_input_uc, '', 'single')
>>> exec s
>>> inside
u'\ud55c\uae00'
>>> # right

HTH,
John




More information about the Python-list mailing list