Processing text data with different encodings

Tue Jun 28 11:11:31 EDT 2016

On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote:

> I changed the code from my initial mail to:
> 
> LOGGER = logging.getLogger()
> LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="utf-8"))
> 
> for l in sys.stdin.buffer:
>     l = l.decode('utf-8')
>     LOGGER.critical(l)

I imagine you're running this over input known to contain UTF-8 text?
Because if you run it over your emails with non-UTF8 content, you'll get an
exception.

I would try this:

for l in sys.stdin.buffer:
    l = l.decode('utf-8', errors='surrogateescape')
    print(repr(l))  # or log it, whichever you prefer

If I try simulating that, you'll see the output:

py> buffer = []
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer.append('abüd\n'.encode('latin-1'))
py> buffer.append('abüd\n'.encode('utf-8'))
py> buffer
[b'ab\xc3\xbcd\n', b'ab\xc3\xbcd\n', b'ab\xfcd\n', b'ab\xc3\xbcd\n']
py> for l in buffer:  #sys.stdin.buffer:
...     l = l.decode('utf-8', errors='surrogateescape')
...     print(repr(l))
...
'abüd\n'
'abüd\n'
'ab\udcfcd\n'
'abüd\n'

See the second last line? The \udcfc code point is a surrogate, encoding
the "bad byte" \xfc. See the docs for further details.

Alternatively, you could try:

for l in sys.stdin.buffer:
    try:
        l = l.decode('utf-8', errors='strict')
    except UnicodeDecodeError:
        l = l.decode('latin1')  # May generate mojibake.
    print(repr(l))  # or log it, whichever you prefer

This version should give satisfactory results if the email actually does
contain lines of Latin-1 (or Windows-1252 if you prefer) mixed in with the
UTF-8. If not, it will generate mojibake, which may be acceptable to your
users.

-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.