[Tutor] Unicode trouble

Wed Nov 30 20:46:30 CET 2005

On Wed, 30 Nov 2005 13:41:54 -0500
Kent Johnson <kent37 at tds.net> wrote:

> >>>This is the full error:
> >>>Traceback (most recent call last):
> >>>  File
> >>>"C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
> >>>line 310, in RunScript
> >>>    exec codeObject in __main__.__dict__
> >>>  File "C:\Python\BA\Oversett.py", line 47, in ?
> >>>  File "C:\Python\BA\Oversett.py", line 23, in kjor
> >>>    en = i.split('\t')[0]
> >>>  File "C:\Python23\lib\codecs.py", line 388, in readlines
> >>>    return self.reader.readlines(sizehint)
> >>>  File "C:\Python23\lib\codecs.py", line 314, in readlines
> >>>    return self.decode(data, self.errors)[0].splitlines(1)
> >>>UnicodeDecodeError: 'utf8' codec can't decode bytes in position 168-170:
> >>>invalid data
> > 
> > 
> >>This is fairly strange as the line
> >> en = i.split('\t')[0]
> >>should not call any method in codecs. I don't know how you can get such a
> >>stack trace.
> > 
> > The file f where en comes from does contain lots of lines with one english
> > word followed by a tab and a norwegian one. (Approximately 25000 lines) It
> > can look like this: core\tkjærne
> 
> Yes, I understand that.
> 
> > So en is supposed to be the english word that the program need to find in
> > MS Word, and to is the replacement word. So wouldn't that be a string that
> > should be handeled by codecs?
> > 
> >         for i in self.f.readlines():
> >             en = i.split('\t')[0]
> 
> The thing is, it's the line
>   for i in self.f.readlines():
> that is calling the codecs module, not the line
>   en = i.split('\t')[0]
> but it is the latter line that is in the stack trace.
> 
> Can any of the other tutors make any sense of this stack trace?

As far as I see here, isn't the line

    return self.decode(data, self.errors)[0].splitlines(1)

causing the traceback?

I haven't read all of this thread, but maybe you are trying to pass a
non-utf8 string to the utf8 codec?

Michael