Changing the default text codec

Fuzzyman michael at foord.net
Mon Feb 23 10:21:29 EST 2004


Paul Prescod <paul at prescod.net> wrote in message news:<mailman.193.1077530419.27104.python-list at python.org>...
> Fuzzyman wrote:
> > Sorry if my terminology is wrong..... but I'm having intermittent
> > problems dealing with accented characters in python. (Only from the 8
> > bit latin-1 character set I think..)
> 
> I would say that if you get a 100% failure rate in IDLE and a 100% 
> success rate from a console program then your problem is not 
> intermittent but environment specific.

If that was the case then I'm sure you'd be right... good not to
quibble about terminology eh ;-)

(in a few other test cases the success-fail pattern was the opposite
way round)


> 
> > For example - if I run my program from IDLE and give it the word
> > 'degri' (containing e-acute) then I get the error :
> 
> What do you mean "give it the word". Through raw_input()? Through a file?
> 

Right - it is fetching the words from a Tkinter entry box using the
get() method.

> However you are getting this information, it seems to me that in IDLE 
> you are getting a Unicode object rather than an 8-bit string object. 
> Convert it to an 8-bit string:
> 
> mydata.encode("latin-1")

Great - that might do the job.
I'll try it.
Thanks.

> 
>  >  if letter in self.valid_letters:
>  > UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
>  > 26: ordinal not in range(128)
> 
> Something looks suspicious here. I wouldn't expect self.valid_letters to 
> have a 0x83 character in it because I would expect it to be hard-coded 
> to ASCII in your program like:
> 

Self.valid_letters *in fact* is string.lowercase - which I thought
included the 8 bit latin-1 letters as well. (the letters are converted
to lowercase by using the .lower() string method )


> valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."
> 
> On the other hand I wouldn't expect "letter" to have more than one 
> character so how could it have a problem at position 26?
> 

I'm iterating over the string.



> > What I'd like to do is switch by default to an 8 bit codec (latin-1 I
> > think ?????) and then offer the user the choice of either mapping the
> > accented characters to their nearest equivalent (e-acute to e for
> > example) *or* treating them as seperate characters.............
> 
> Why change the default codec rather than explicitly using the codec you 
> care about? If you want to work in the 8-bit world rather than the 
> Unicode world, just use the "encode" function on the Unicode object. If 
> you want to work in the Unicode world.
> 


Great - sounds good.

> > I can't work out how to change the default codec (no matter what the
> > locale) ?
> 
> I'd advise against fixing the problem in that way. Convert data 
> appropriately when you bring it from the outside world into the Python 
> program and ignore the default codec.
> 
>   Paul Prescod

Thanks for your help.



Fuzzyman

http://www.voidspace.org.uk/atlantibots/pythonutils.html



More information about the Python-list mailing list