[Tutor] Urgent: unicode problems writing CSV file
Steven D'Aprano
steve at pearwood.info
Wed Jun 8 13:43:29 EDT 2016
On Wed, Jun 08, 2016 at 09:54:23AM -0400, Alex Hall wrote:
> All,
> I'm working on a project that writes CSV files, and I have to get it done
> very soon. I've done this before, but I'm suddenly hitting a problem with
> unicode conversions. I'm trying to write data, but getting the standard
> cannot encode character: ordinal not in range(128)
I infer from your error that you are using Python 2. Is that right? You
should say so, *especially* for Unicode problems, because Python 3 uses
a very different (and much better) system for handling text strings.
Also, there is no such thing as a "standard" error. All error messages
are different, and they usually show lots of debugging information that
you haven't yet learned to read. But we have, so please show us the full
traceback!
> I've tried
> str(info).encode("utf8")
> str(info).decode(utf8")
One of the problems with Python 2 is that it allows two nonsense
operations: str.encode and unicode.decode. The whole string handling
thing in Python 2 is a bit of a mess. It's over 20 years old, and dates
back to before Unicode even existed, so you'll have to excuse a bit of
confusion. In Python 2:
(1) str means *byte string*, NOT text string, and is limited
to "chars" with ordinal values 0 to 255;
(2) unicode means "text string";
(3) In an attempt to be helpful, Python 2 will try to automatically
convert to and from bytes strings as needed. This works so long
as all your characters are ASCII, but leads to chaos, confusion
and error as soon as you have non-ASCII characters involved.
Python 3 fixes these confusing features.
Remember two facts:
(1) To go from TEXT to BYTES (i.e. unicode -> str) use ENCODE;
(2) To go from BYTES to TEXT (i.e. str -> unicode) use DECODE.
but you must be careful to prevent Python doing those automatic
conversions first.
Looking at your code:
str(info).encode("utf8")
that's wrong, because it tries to go from str->unicode using encode. But
using decode also gives the same error. That hints that the error is
happening in the call to str() first.
Firstly, we need to know what info is. Run this:
print type(info)
print repr(info)
print str(info)
and report any errors and output. I'm going to assume that info is a
unicode object. Why? Because that will give the error you experience:
py> info = u'abcµ'
py> str(info)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in
position 3: ordinal not in range(128)
The right way to convert unicode text to a byte str is with the encode
method. Unless you have good reason to use another encoding, always use
UTF-8 (which I see you are doing, great).
py> info.encode('utf-8')
'abc\xc2\xb5'
If all your Unicode text strings are valid and correct, that should be
all you need, but if you are paranoid and fear "invalid" Unicode
strings, which can theoretically happen (ask me how if you care), you
can take a belt-and-braces approach and preemptively deal with errors by
converting them to question marks.
NOTE THAT THIS THROWS AWAY INFORMATION FROM YOUR UNICODE TEXT.
If your paranoia exceeds your fear of losing information, you can
instruct Python to use a ? any time there is an encoding error:
info.encode('utf-8', errors='replace')
So to recap:
- you have a variable `info`, which I am guessing is unicode
- you can convert it to a byte str with:
info.encode('utf-8')
or for the paranoid:
info.encode('utf-8', errors='replace')
Now that you have a byte string, you can just write it out to the CSV
file.
To read it back in, you read the CSV file, which returns a byte str, and
then convert back to Unicode with:
info = data.decode('utf-8')
> unicode(info, "utf8")
When you run this, what exception do you get? My guess is that you get
the following TypeError:
py> unicode(u'abc', 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported
> csvFile = open("myFile.csv", "wb", encoding="utf-8") #invalid keyword
> argument
Python 3 allows you to set the encoding of files, Python 2 doesn't. In
Python 2 you can use the io module, but note that this won't help you as
(1) the csv module doesn't support Unicode, and (2) your problem lies
elsewhere.
P.S. don't feel bad if the whole Unicode thing is confusing you. Most
people go through a period of confusion, because you have to unlearn
nearly everything you thought you knew about text in computers before
you can really get Unicode.
--
Steve
More information about the Tutor
mailing list