Detecteing Unicode encodings

Jason Diamond jason at injektilo.org
Sat Aug 21 13:57:34 EDT 2004


Hi.

Is it possible to decode a UTF-8 (with or without a BOM), UTF-16 (BE or
LE with a BOM), or UTF-32 (BE or LE with a BOM) byte stream without
knowing what encoding the stream is in?

I know how to use the codecs module to get StreamReader classes that can
decode a specific encoding but I have to know what that enocding is
before hand.

If I read up to four bytes from the byte stream, I can figure out what
encoding the stream is in but that has problems for UTF-8 streams
without BOMs--I would have just eaten one or more bytes that might need
to be decoded by the StreamReader. I could seek back to the beginning of
the stream but what if the file-like object I was reading from didn't
support seeking?

Thanks.

-- Jason



More information about the Python-list mailing list