Processing text data with different encodings

Tue Jun 28 04:30:20 EDT 2016

Michael Welle wrote:

> Hello,
> 
> I want to use Python 3 to process data, that unfortunately come with
> different encodings. So far I have found ascii, iso-8859, utf-8,
> windows-1252 and maybe some more in the same file (don't ask...). I read
> the data via sys.stdin and the idea is to read a line, detect the
> current encoding, hit it until it looks like utf-8 and then go on with
> the next line of input:
> 
> 
> import cchardet
> 
> for line in sys.stdin.buffer:
> 
>     encoding = cchardet.detect(line)['encoding']
>     line = line.decode(encoding, 'ignore')\
>                .encode('UTF-8').decode('UTF-8', 'ignore')

Here the last decode('UTF-8', 'ignore') undoes the preceding 
encode('UTF-8'); therefore

      line = line.decode(encoding, 'ignore')

should suffice. Does chardet ever return an encoding that fails to decode 
the line? Only in that case the "ignore" error handler would make sense. I 
expect that

for line in sys.stdin.buffer:
    encoding = cchardet.detect(line)['encoding']
    line = line.decode(encoding)

will work if you don't want to use the alternative suggested by Chris.

> After that line should be a string. The logging module and some others
> choke on line: UnicodeEncodeError: 'charmap' codec can't encode
> character. What would be a right approach to tackle that problem
> (assuming that I can't change the input data)?

It looks like you are trying to write the unicode you have generated above 
into a file using iso-8859-1 or similar:

$ cat log_unicode.py
import logging
LOGGER = logging.getLogger()
LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="ISO-8859-1"))
LOGGER.critical("\N{PILE OF POO}")
$ python3 log_unicode.py 
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.4/logging/__init__.py", line 980, in emit
    stream.write(msg)
UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f4a9' in 
position 0: ordinal not in range(256)
Call stack:
  File "log_unicode.py", line 5, in <module>
    LOGGER.critical("\N{PILE OF POO}")
Message: '💩'
Arguments: ()

If my assumption is correct you can either change the target file's encoding 
to UTF-8 or change the error handling strategy to ignore or something else. 
I didn't find an official way, so here's a minimal example:

$ rm tmp.txt
$ cat log_unicode.py
import logging

class FileHandler(logging.FileHandler):
    def _open(self):
        return open(
            self.baseFilename, self.mode, encoding=self.encoding,
            errors="xmlcharrefreplace")

LOGGER = logging.getLogger()
LOGGER.addHandler(FileHandler("tmp.txt", encoding="ISO-8859-1"))
LOGGER.critical("\N{PILE OF POO}")
$ python3 log_unicode.py 
$ cat tmp.txt
💩

A real program would of course override the initializer...