Need debugging knowhow for my creeping Unicodephobia

kj no.email at please.post
Wed Feb 10 14:09:46 EST 2010



Some people have mathphobia.  I'm developing a wicked case of
Unicodephobia.

I have read a *ton* of stuff on Unicode.  It doesn't even seem all
that hard.  Or so I think.  Then I start writing code, and WHAM:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

(There, see?  My Unicodephobia just went up a notch.)

Here's the thing: I don't even know how to *begin* debugging errors
like this.  This is where I could use some help.

In the past I've gone for method of choice of the clueless:
"programming by trial-and-error", try random crap until something
"works."  And if that "strategy" fails, I come begging for help to
c.l.p.  And thanks for the very effective pointers for getting rid
of the errors.

But afterwards I remain as clueless as ever...  It's the old "give
a man a fish" vs. "teach a man to fish" story.

I need a systematic approach to troubleshooting and debugging these
Unicode errors.  I don't know what.  Some tools maybe.  Some useful
modules or builtin commands.  A diagnostic flowchart?  I don't
think that any more RTFM on Unicode is going to help (I've done it
in spades), but if there's a particularly good write-up on Unicode
debugging, please let me know.

Any suggestions would be much appreciated.

FWIW, I'm using Python 2.6.  The example above happens to come from
a script that extracts data from HTML files, which are all in
English, but they are a daily occurrence when I write code to
process non-English text.  The script uses Beautiful Soup.  I won't
post a lot of code because, as I said, what I'm after is not so
much a way around this specific error as much as the tools and
techniques to troubleshoot it and fix it on my own.  But to ground
the problem a bit I'll say that the exception above happens during
the execution of a statement of the form:

  x = '%s %s' % (y, z)

Also, I found that, with the exact same values y and z as above,
all of the following statements work perfectly fine:

  x = '%s' % y
  x = '%s' % z
  print y
  print z
  print y, z

TIA!

~K



More information about the Python-list mailing list