read from file with mixed encodings in Python3

Mon Nov 7 09:33:33 EST 2011

On 11/07/2011 09:23 AM, Jaroslav Dobrek wrote:
> Hello,
>
> in Python3, I often have this problem: I want to do something with
> every line of a file. Like Python3, I presuppose that every line is
> encoded in utf-8. If this isn't the case, I would like Python3 to do
> something specific (like skipping the line, writing the line to
> standard error, ...)
>
> Like so:
>
> try:
>     ....
> except UnicodeDecodeError:
>    ...
>
> Yet, there is no place for this construction. If I simply do:
>
> for line in f:
>      print(line)
>
> this will result in a UnicodeDecodeError if some line is not utf-8,
> but I can't tell Python3 to stop:
>
> This will not work:
>
> for line in f:
>      try:
>          print(line)
>      except UnicodeDecodeError:
>          ...
>
> because the UnicodeDecodeError is caused in the "for line in f"-part.
>
> How can I catch such exceptions?
>
> Note that recoding the file before opening it is not an option,
> because often files contain many different strings in many different
> encodings.
>
> Jaroslav
A file with mixed encodings isn't a text file.  So open it with 'rb' 
mode, and use read() on it.  Find your own line-endings, since a given 
'\n' byte may or may not be a line-ending.

Once you've got something that looks like a line, explicitly decode it 
using utf-8.  Some invalid lines will give an exception and some will 
not.  But perhaps you've got some other gimmick to tell the encoding for 
each line.

-- 

DaveA