[Tutor] UnicodeDecodeError while parsing a .csv file.

Albert-Jan Roskam fomcl at yahoo.com
Tue Oct 29 11:33:03 CET 2013


-------------------------------------------
On Tue, 10/29/13, eryksun <eryksun at gmail.com> wrote:

 Subject: Re: [Tutor] UnicodeDecodeError while parsing a .csv file.
 To: "Steven D'Aprano" <steve at pearwood.info>
 Cc: tutor at python.org
 Date: Tuesday, October 29, 2013, 3:24 AM
 
 On Mon, Oct 28, 2013 at 7:49 PM,
 Steven D'Aprano <steve at pearwood.info>
 wrote:
 >
 > By default Python 3 uses UTF-8 when reading files. As
 the error below
 > shows, your file actually isn't UTF-8.
 
 Modules default to UTF-8, but io.TextIOWrapper defaults to
 the locale
 preferred encoding. To handle terminals, it first tries
 os.device_encoding (i.e. _Py_device_encoding). Otherwise for
 files it
 defaults to locale.getpreferredencoding(False).
 
 
==> Why is do_setlocale=False here? Actually, what does this parameter do? It seems strange that a getter function has a 'set' argument.

>>> import locale
>>> help(locale.getpreferredencoding)
Help on function getpreferredencoding in module locale:

getpreferredencoding(do_setlocale=True)
    Return the charset that the user is likely using.

Other remark: I have not read this entire thread, but I was thinking the OP might use codecs.open to open the file in the correct encoding. If that encoding is unknown, maybe chardet could be used to guess it: https://pypi.python.org/pypi/chardet. I have never used this module, but it seems worth giving a try.

The other day I received a file that was encoded multiple times so accented characters were all messed up. I had to reverse engineer this and it turned out that a sequence of latin-1 and utf-8 had been used. Would be nice if (1) this wouldn't happen in the first place ;-) (2) Some library would help with this "de-mojibake" process.





More information about the Tutor mailing list