Encoding sniffer?
Stuart Bishop
stuart at stuartbishop.net
Wed Jan 11 03:26:08 EST 2006
skip at pobox.com wrote:
> Andreas> Does anyone know of a Python module that is able to sniff the
> Andreas> encoding of text?
>
> I have such a beast. Search here:
>
> http://orca.mojam.com/~skip/python/
>
> for "decode".
>
> Skip
We have similar code. It looks functionally the same except that we also:
Check if the string starts with a BOM.
Detects probable ISO-8859-15 using a set of characters common
is ISO-8859-15 but uncommon in ISO-8859-1
Doctests :-)
# Detect BOM
_boms = [
(codecs.BOM_UTF16_BE, 'utf_16_be'),
(codecs.BOM_UTF16_LE, 'utf_16_le'),
(codecs.BOM_UTF32_BE, 'utf_32_be'),
(codecs.BOM_UTF32_LE, 'utf_32_le'),
]
try:
for bom, encoding in _boms:
if s.startswith(bom):
return unicode(s[len(bom):], encoding)
except UnicodeDecodeError:
pass
[...]
# If we have characters in this range, it is probably ISO-8859-15
if re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None:
try:
return unicode(s, 'ISO-8859-15')
except UnicodeDecodeError:
pass
Feel free to update your available code. Otherwise, I can probably post ours
somewhere if necessary.
--
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20060111/e6aff6f3/attachment.sig>
More information about the Python-list
mailing list