read from file with mixed encodings in Python3

Mon Nov 7 09:42:47 EST 2011

Jaroslav Dobrek wrote:

> Hello,
> 
> in Python3, I often have this problem: I want to do something with
> every line of a file. Like Python3, I presuppose that every line is
> encoded in utf-8. If this isn't the case, I would like Python3 to do
> something specific (like skipping the line, writing the line to
> standard error, ...)
> 
> Like so:
> 
> try:
>    ....
> except UnicodeDecodeError:
>   ...
> 
> Yet, there is no place for this construction. If I simply do:
> 
> for line in f:
>     print(line)
> 
> this will result in a UnicodeDecodeError if some line is not utf-8,
> but I can't tell Python3 to stop:
> 
> This will not work:
> 
> for line in f:
>     try:
>         print(line)
>     except UnicodeDecodeError:
>         ...
> 
> because the UnicodeDecodeError is caused in the "for line in f"-part.
> 
> How can I catch such exceptions?
> 
> Note that recoding the file before opening it is not an option,
> because often files contain many different strings in many different
> encodings.

I don't see those files often, but I think they are all seriously broken. 
There's no way to recover the information from files with unknown mixed 
encodings. However, here's an approach that may sometimes work: 

>>> with open("tmp.txt", "rb") as f:
...     for line in f:
...             try:
...                     line = "UTF-8 " + line.decode("utf-8")
...             except UnicodeDecodeError:
...                     line = "Latin-1 " + line.decode("latin-1")
...             print(line, end="")
...
UTF-8 äöü
Latin-1 äöü
UTF-8 äöü