Unicode chr(150) en dash

Fri Apr 18 07:36:00 EDT 2008

On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote:
> On Fri, 2008-04-18 at 10:28 +0100, marexposed at googlemail.com wrote:
> > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> > hdante <hdante at gmail.com> wrote:
> > 
> > >  Don't use old 8-bit encodings. Use UTF-8.
> > 
> > Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
> > To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.
> > 
> > I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
> > Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.
> > 
> > Thanks to everyone for the great help.
> > 
> 
> There are a number of code points (150 being one of them) that are used
> in cp1252, which are reserved for control characters in ISO-8859-1.
> Those characters will pretty much never be used in ISO-8859-1 documents.
> If you're expecting documents of both types coming in, test for the
> presence of those characters, and assume cp1252 for those documents.  
> 
> Something like:
> 
> for c in control_chars:
>     if c in encoded_text:
> 	unicode_text = encoded_text.decode('cp1252')
>         break
> else:
>     unicode_text = encoded_text.decode('latin-1')
> 
> Note that the else matches the for, not the if.
> 
> You can figure out the characters to match on by looking at the
> wikipedia pages for the encodings.

One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way.  Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break.  You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:

try:
    unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
    # do the stuff above

None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
own.

If in doubt, prompt the user for confirmation.

Maybe others can share better "best practices."

Cheers,
Cliff