Need debugging knowhow for my creeping Unicodephobia

kj no.email at please.post
Wed Feb 10 16:03:01 EST 2010


In <402ac982-0750-4977-adb2-602b19149d81 at m24g2000prn.googlegroups.com> Jonathan Gardner <jgardner at jonathangardner.net> writes:

>On Feb 10, 11:09=A0am, kj <no.em... at please.post> wrote:
>> FWIW, I'm using Python 2.6. =A0The example above happens to come from
>> a script that extracts data from HTML files, which are all in
>> English, but they are a daily occurrence when I write code to
>> process non-English text. =A0The script uses Beautiful Soup. =A0I won't
>> post a lot of code because, as I said, what I'm after is not so
>> much a way around this specific error as much as the tools and
>> techniques to troubleshoot it and fix it on my own. =A0But to ground
>> the problem a bit I'll say that the exception above happens during
>> the execution of a statement of the form:
>>
>> =A0 x =3D '%s %s' % (y, z)
>>
>> Also, I found that, with the exact same values y and z as above,
>> all of the following statements work perfectly fine:
>>
>> =A0 x =3D '%s' % y
>> =A0 x =3D '%s' % z
>> =A0 print y
>> =A0 print z
>> =A0 print y, z
>>

>What are y and z?

  x = "%s %s" % (table['id'], table.tr.renderContents())

where the variable table represents a BeautifulSoup.Tag instance.

>Are they unicode or strings?

The first item (table['id']) is unicode, and the second is str.

>What are their values?

The only easy way I know to examine the values of these strings is
to print them, which, I know, is very crude.  (IOW, to answer this
question usefully, in the context of this problem, more Unicode
knowhow is needed than I have.)  If I print them, the output for
the first one on my screen is "mainTable", and for the second it is

<th class="mainTableHeader" colspan="2">  Tags</th>
<th class="mainTableHeader">  Id</th>


>It sounds like someone, probably beautiful soup, is trying to turn
>your strings into unicode. A full stacktrace would be useful to see
>who did what where.

Unfortunately, there's not much in the stacktrace: 

Traceback (most recent call last):
  File "./download_tt.py", line 427, in <module>
    x = "%s %s" % (table['id'], table.tr.renderContents())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 41: ordinal not in range(128)

(NB: the difference between this error message and the one I
originally posted, namely the position of the unrecognized byte,
is because I simplified the code for the purpose of posting it
here, eliminating one additional processing of the second entry of
the tuple above.)

~K




More information about the Python-list mailing list