read a file and remove Mojibake chars

Thu Apr 7 10:19:58 EDT 2016

On Thu, Apr 7, 2016, at 04:47, Daiyue Weng wrote:
> Hi, when I read a file, the file string contains Mojibake chars at the
> beginning, the code is like,
> 
> file_str = open(file_path, 'r', encoding='utf-8').read()
> print(repr(open(file_path, 'r', encoding='utf-8').read())
> 
> part of the string (been printing) containing Mojibake chars is like,
> 
>   '锘縶\n "name": "__NAME__"'

Based on a hunch, I tried something:

"锘縶" happens to be the GBK/GB18030 interpretation of the bytes "ef bb bf
7b", which is a UTF-8 byte order mark followed by "{".

So what happened is that someone wrote text in UTF-8 with a byte-order
marker, and someone else read this as GBK/GB18030 and wrote the
resulting characters as UTF-8. So it may be easier to simply
special-case it:

if file_str[:2] == '锘縶': file_str = '{' + file_str[2:]
elif file_str[:2] == '锘縖': file_str = '[' + file_str[2:]

In principle, the whole process could be reversed as file_str =
file_str.encode('gbk').decode('utf-8'), but that would be overkill if it
contains no other ASCII characters and can't contain anything at the
start except these. Plus, if there are any other non-ASCII characters in
the string, it's anyone's guess as to whether they survived the process
in a way that allows you to reverse it.