read a file and remove Mojibake chars

Peter Otten __peter__ at web.de
Thu Apr 7 05:49:33 EDT 2016


Daiyue Weng wrote:

> Hi, when I read a file, the file string contains Mojibake chars at the
> beginning, the code is like,
> 
> file_str = open(file_path, 'r', encoding='utf-8').read()
> print(repr(open(file_path, 'r', encoding='utf-8').read())
> 
> part of the string (been printing) containing Mojibake chars is like,
> 
>   '锘縶\n "name": "__NAME__"'
> 
> I tried to remove the non utf-8 chars using the code,
> 
> def read_config_file(fname):
>     with open(fname, "r", encoding='utf-8') as fp:
>         for line in fp:
>             line = line.strip()
>             line = line.decode('utf-8','ignore').encode("utf-8")
> 
>     return fp.read()
> 
> but it doesn't work, so how to remove the Mojibakes in this case?

I'd first investigate if the file can correctly be decoded using an encoding 
other than UTF-8, but if it's really hopeless and your best bet is to ignore 
all non-ascii characters try

def read_config_file(fname):
    with open(fname, "r", encoding="ascii", errors="ignore") as f:
        return f.read()




More information about the Python-list mailing list