Encoding sniffer?

Stuart Bishop stuart at stuartbishop.net
Wed Jan 11 03:26:08 EST 2006


skip at pobox.com wrote:
>     Andreas> Does anyone know of a Python module that is able to sniff the
>     Andreas> encoding of text?
> 
> I have such a beast.  Search here:
> 
>     http://orca.mojam.com/~skip/python/
> 
> for "decode".
> 
> Skip

We have similar code. It looks functionally the same except that we also:

	Check if the string starts with a BOM.
	Detects probable ISO-8859-15 using a set of characters common
	is ISO-8859-15 but uncommon in ISO-8859-1
	Doctests :-)

    # Detect BOM
    _boms = [
        (codecs.BOM_UTF16_BE, 'utf_16_be'),
        (codecs.BOM_UTF16_LE, 'utf_16_le'),
        (codecs.BOM_UTF32_BE, 'utf_32_be'),
        (codecs.BOM_UTF32_LE, 'utf_32_le'),
        ]

    try:
        for bom, encoding in _boms:
            if s.startswith(bom):
                return unicode(s[len(bom):], encoding)
    except UnicodeDecodeError:
        pass

    [...]

    # If we have characters in this range, it is probably ISO-8859-15
    if re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None:
        try:
            return unicode(s, 'ISO-8859-15')
        except UnicodeDecodeError:
            pass

Feel free to update your available code. Otherwise, I can probably post ours
somewhere if necessary.

-- 
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20060111/e6aff6f3/attachment.sig>


More information about the Python-list mailing list