Unicode chr(150) en dash

Fri Apr 18 07:27:37 EDT 2008

On Fri, 2008-04-18 at 10:28 +0100, marexposed at googlemail.com wrote:
> On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> hdante <hdante at gmail.com> wrote:
> 
> >  Don't use old 8-bit encodings. Use UTF-8.
> 
> Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
> To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.
> 
> I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
> Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.
> 
> Thanks to everyone for the great help.
> 

There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.  

Something like:

for c in control_chars:
    if c in encoded_text:
	unicode_text = encoded_text.decode('cp1252')
        break
else:
    unicode_text = encoded_text.decode('latin-1')

Note that the else matches the for, not the if.

You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.

Cheers,
Cliff