Need debugging knowhow for my creeping Unicodephobia

Wed Feb 10 15:05:55 EST 2010

kj wrote:
> 
> Some people have mathphobia.  I'm developing a wicked case of
> Unicodephobia.
> 
> I have read a *ton* of stuff on Unicode.  It doesn't even seem all
> that hard.  Or so I think.  Then I start writing code, and WHAM:
> 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
> 
> (There, see?  My Unicodephobia just went up a notch.)
> 
> Here's the thing: I don't even know how to *begin* debugging errors
> like this.  This is where I could use some help.
> 
> In the past I've gone for method of choice of the clueless:
> "programming by trial-and-error", try random crap until something
> "works."  And if that "strategy" fails, I come begging for help to
> c.l.p.  And thanks for the very effective pointers for getting rid
> of the errors.
> 
> But afterwards I remain as clueless as ever...  It's the old "give
> a man a fish" vs. "teach a man to fish" story.
> 
> I need a systematic approach to troubleshooting and debugging these
> Unicode errors.  I don't know what.  Some tools maybe.  Some useful
> modules or builtin commands.  A diagnostic flowchart?  I don't
> think that any more RTFM on Unicode is going to help (I've done it
> in spades), but if there's a particularly good write-up on Unicode
> debugging, please let me know.
> 
> Any suggestions would be much appreciated.
> 
> FWIW, I'm using Python 2.6.  The example above happens to come from
> a script that extracts data from HTML files, which are all in
> English, but they are a daily occurrence when I write code to
> process non-English text.  The script uses Beautiful Soup.  I won't
> post a lot of code because, as I said, what I'm after is not so
> much a way around this specific error as much as the tools and
> techniques to troubleshoot it and fix it on my own.  But to ground
> the problem a bit I'll say that the exception above happens during
> the execution of a statement of the form:
> 
>   x = '%s %s' % (y, z)
> 
> Also, I found that, with the exact same values y and z as above,
> all of the following statements work perfectly fine:
> 
>   x = '%s' % y
>   x = '%s' % z
>   print y
>   print z
>   print y, z
> 
Decode all text input; encode all text output; do all text processing
in Unicode, which also means making all text literals Unicode (prefixed
with 'u').

Note: I'm talking about when you're working with _text_, as distinct
from when you're working with _binary data_, ie bytes.