Processing text data with different encodings

Tue Jun 28 03:46:09 EDT 2016

On Tue, Jun 28, 2016 at 5:25 PM, Michael Welle <mwe012008 at gmx.net> wrote:
> I want to use Python 3 to process data, that unfortunately come with
> different encodings. So far I have found ascii, iso-8859, utf-8,
> windows-1252 and maybe some more in the same file (don't ask...). I read
> the data via sys.stdin and the idea is to read a line, detect the
> current encoding, hit it until it looks like utf-8 and then go on with
> the next line of input:
>
>
> import cchardet
>
> for line in sys.stdin.buffer:
>
>     encoding = cchardet.detect(line)['encoding']
>     line = line.decode(encoding, 'ignore')\
>                .encode('UTF-8').decode('UTF-8', 'ignore')
>
>
> After that line should be a string. The logging module and some others
> choke on line: UnicodeEncodeError: 'charmap' codec can't encode
> character. What would be a right approach to tackle that problem
> (assuming that I can't change the input data)?

This is the exact sort of "ewwww" that I have to cope with in my MUD
client. Sometimes it gets sent UTF-8, other times it gets sent...
uhhhh... some eight-bit encoding, most likely either 8859 or 1252 (but
could theoretically be anything). The way I cope with it is to do a
line-by-line decode, similar to what you're doing, but with a much
simpler algorithm - something like this:

for line in <binary source>:
    try:
        line = line.decode("UTF-8")
    except UnicodeDecodeError:
        line = line.decode("1252")
    yield line

There's no need to chardet for UTF-8; if you successfully decode the
text, it's almost certainly correct. (This includes pure ASCII text,
which would also decode successfully and correctly as ISO-8859 or
Windows-1252.)

You shouldn't need this complicated triple-encode dance. Just decode
it once and work with text from there on. Ideally, you should be using
Python 3, where "work[ing] with text" is exactly how most of the code
wants to work; if not, resign yourself to reprs with u-prefixes, and
work with Unicode strings anyway. It'll save you a lot of trouble.

ChrisA