Problems With Accented Characters

Fuzzyman michael at foord.net
Mon Feb 23 03:45:52 EST 2004


michael at foord.net (Fuzzyman) wrote in message news:<8089854e.0402221547.2f6cf5f7 at posting.google.com>...
> I've written an anagram finder that produces anagrams from a
> dictionary of words. The user can load their own dictionary.
> 
> ( http://www.voidspace.org.uk/atlantibots/nanagram.html )
> 
> In order to ensure it is able to  find anagrams properly I wanted to
> strip characters like punctuation etc from words in the dictionary and
> words the user entered. I test(ed) against the 26 English letters (
> string.ascii_lowercase ).
> 
> I now have someone who wants to use a French dictionary - with words
> containing accented characters !! I have two choices - either map the
> accented characters to their unaccented equivalent (slightly
> innacurate) or treat the accented charcters as a separate letter (very
> few anagrams). However - at the moment I can't experiment with either
> because my default codec is the 7-bit ascii and crashes (sometimes !!)
> when using the accented characters.
> 


It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
  File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
    if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")...  It crashes when it comes tot he
e-acute.


*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam	holder
hallo	merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............


Anyone able to help ??



Fuzzy



> Has anyone any advice - or can point me to any resources - for
> effectively handling these characters. I guess it's a latin-1 encoding
> I want to use... I can't even work out how to cahnge the default
> codec........
> 
> Thanks,
> 
> Fuzzy
> 
> http://www.voidspace.org.uk/atlantibots/pythonutils.html



More information about the Python-list mailing list