Unicode chr(150) en dash

J. Clifford Dyer jcd at sdf.lonestar.org
Fri Apr 18 07:27:37 EDT 2008


On Fri, 2008-04-18 at 10:28 +0100, marexposed at googlemail.com wrote:
> On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> hdante <hdante at gmail.com> wrote:
> 
> >  Don't use old 8-bit encodings. Use UTF-8.
> 
> Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content.
> To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business.
> 
> I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters.
> Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now.
> 
> Thanks to everyone for the great help.
> 

There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.  

Something like:

for c in control_chars:
    if c in encoded_text:
	unicode_text = encoded_text.decode('cp1252')
        break
else:
    unicode_text = encoded_text.decode('latin-1')

Note that the else matches the for, not the if.

You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.

Cheers,
Cliff





More information about the Python-list mailing list