read a file and remove Mojibake chars

Thu Apr 7 08:51:45 EDT 2016

On Thu, Apr 7, 2016 at 6:47 PM, Daiyue Weng <daiyueweng at gmail.com> wrote:
> Hi, when I read a file, the file string contains Mojibake chars at the
> beginning, the code is like,
>
> file_str = open(file_path, 'r', encoding='utf-8').read()
> print(repr(open(file_path, 'r', encoding='utf-8').read())
>
> part of the string (been printing) containing Mojibake chars is like,
>
>   '锘縶\n "name": "__NAME__"'
>
> I tried to remove the non utf-8 chars using the code,
>
> def read_config_file(fname):
>     with open(fname, "r", encoding='utf-8') as fp:
>         for line in fp:
>             line = line.strip()
>             line = line.decode('utf-8','ignore').encode("utf-8")
>
>     return fp.read()
>
> but it doesn't work, so how to remove the Mojibakes in this case?

This won't work as it currently stands. You're looping over the file,
stripping, *DE*coding (which shouldn't work - although in Python 2, it
sorta-kinda might), re-encoding, and then dropping the lines on the
floor. Then, after you've closed the file, you try to read from it. So
yeah, it doesn't work.

But if you're able to read the file *at all* using your original code,
it must be a correctly-formed UTF-8 stream. The probability that
random non-ASCII bytes just happen to be UTF-8 decodable is
vanishingly low, so I suspect your data issue has nothing to do with
encodings.

ChrisA