[Tutor] Unicode trouble

Øyvind python at kapitalisten.no
Wed Nov 30 19:14:14 CET 2005


Øyvind wrote:
>
>>Where are you getting these errors (what line of the program)? Do you
>
> know >what kind of strings objSelection.Find.Execute() is expecting?
>
>>Kent
>
>
>> The program stops working and gives me these errors when I try to run it
>> when it encounters a non-english letter.
>
>> This is the full error:
>> Traceback (most recent call last):
>>   File
>> "C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
>> line 310, in RunScript
>>     exec codeObject in __main__.__dict__
>>   File "C:\Python\BA\Oversett.py", line 47, in ?
>>   File "C:\Python\BA\Oversett.py", line 23, in kjor
>>     en = i.split('\t')[0]
>>   File "C:\Python23\lib\codecs.py", line 388, in readlines
>>     return self.reader.readlines(sizehint)
>>   File "C:\Python23\lib\codecs.py", line 314, in readlines
>>     return self.decode(data, self.errors)[0].splitlines(1)
>> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
>> invalid data

>This is fairly strange as the line
>  en = i.split('\t')[0]
>should not call any method in codecs. I don't know how you can get such a
>stack trace.

The file f where en comes from does contain lots of lines with one english
word followed by a tab and a norwegian one. (Approximately 25000 lines) It
can look like this: core\tkjærne
So en is supposed to be the english word that the program need to find in
MS Word, and to is the replacement word. So wouldn't that be a string that
should be handeled by codecs?

        for i in self.f.readlines():
            en = i.split('\t')[0]

>Maybe try deleting all the .pyc files to make sure they are in sync with
>the source and try again?

This didn't seem to help.

>The actual error indicates that the input data is not valid utf-8. Are
you >sure that is the correct encoding for the input file? If the file is
utf-8 >and has bad characters you could pass error='ignore' or
error='replace' as >a parameter to codecs.open() to change the error
handling style to >something more forgiving.

Is not valid utf-8? I have tried with latin-1 as well. No avail. The
letters that are the problem is æøå. They shouldn't be that exotic?

>> Traceback (most recent call last):
>>   File
>> "C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
>> line 310, in RunScript
>>     exec codeObject in __main__.__dict__
>>   File "C:\Python\BA\Oversett.py", line 49, in ?
>>   File "C:\Python\BA\Oversett.py", line 33, in kjor
>>     if t % 1000 == 0:
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 17:
>> ordinal not in range(128)

>Again this stack trace doesn't make sense, the indicated line doesn't do
>any string operation.

>This error message normally occurs when a non-ascii string is converted
to >unicode using the default encoding (which is 'ascii'). Often the
>conversion is implicit in some other operation but I don't see any such
>operation here.

But regardless, shouldn't 'ascii' be excluded here? Since I tell the
program to change to utf-8, not only once but twice?

>> objSelection.Find.Execute() is supposed to accept any kind of string. (It
>> is the function Search & Replace in MS Word).

>It has to make some assumption about the type of the string. Does it want
>unicode or encoded bytes? If encoded bytes, what encoding does it
expect?

I think the letters should be accepted. The pythonscript here is written
to replace abot 25000 MS Word-macros, so all the letters have been
accepted by MS Word when feeded by Visual Basic. All I have done now is to
extract the words from the macros and put them in a file.




-- 
This email has been scanned for viruses & spam by Decna as - www.decna.no
Denne e-posten er sjekket for virus & spam av Decna as - www.decna.no



More information about the Tutor mailing list