[Tutor] UnicodeDecodeError while parsing a .csv file.
Albert-Jan Roskam
fomcl at yahoo.com
Tue Oct 29 11:33:03 CET 2013
-------------------------------------------
On Tue, 10/29/13, eryksun <eryksun at gmail.com> wrote:
Subject: Re: [Tutor] UnicodeDecodeError while parsing a .csv file.
To: "Steven D'Aprano" <steve at pearwood.info>
Cc: tutor at python.org
Date: Tuesday, October 29, 2013, 3:24 AM
On Mon, Oct 28, 2013 at 7:49 PM,
Steven D'Aprano <steve at pearwood.info>
wrote:
>
> By default Python 3 uses UTF-8 when reading files. As
the error below
> shows, your file actually isn't UTF-8.
Modules default to UTF-8, but io.TextIOWrapper defaults to
the locale
preferred encoding. To handle terminals, it first tries
os.device_encoding (i.e. _Py_device_encoding). Otherwise for
files it
defaults to locale.getpreferredencoding(False).
==> Why is do_setlocale=False here? Actually, what does this parameter do? It seems strange that a getter function has a 'set' argument.
>>> import locale
>>> help(locale.getpreferredencoding)
Help on function getpreferredencoding in module locale:
getpreferredencoding(do_setlocale=True)
Return the charset that the user is likely using.
Other remark: I have not read this entire thread, but I was thinking the OP might use codecs.open to open the file in the correct encoding. If that encoding is unknown, maybe chardet could be used to guess it: https://pypi.python.org/pypi/chardet. I have never used this module, but it seems worth giving a try.
The other day I received a file that was encoded multiple times so accented characters were all messed up. I had to reverse engineer this and it turned out that a sequence of latin-1 and utf-8 had been used. Would be nice if (1) this wouldn't happen in the first place ;-) (2) Some library would help with this "de-mojibake" process.
More information about the Tutor
mailing list