[Tutor] Urgent: unicode problems writing CSV file

Wed Jun 8 13:43:29 EDT 2016

On Wed, Jun 08, 2016 at 09:54:23AM -0400, Alex Hall wrote:
> All,
> I'm working on a project that writes CSV files, and I have to get it done
> very soon. I've done this before, but I'm suddenly hitting a problem with
> unicode conversions. I'm trying to write data, but getting the standard
> cannot encode character: ordinal not in range(128)

I infer from your error that you are using Python 2. Is that right? You 
should say so, *especially* for Unicode problems, because Python 3 uses 
a very different (and much better) system for handling text strings.

Also, there is no such thing as a "standard" error. All error messages 
are different, and they usually show lots of debugging information that 
you haven't yet learned to read. But we have, so please show us the full 
traceback!

> I've tried
> str(info).encode("utf8")
> str(info).decode(utf8")

One of the problems with Python 2 is that it allows two nonsense 
operations: str.encode and unicode.decode. The whole string handling 
thing in Python 2 is a bit of a mess. It's over 20 years old, and dates 
back to before Unicode even existed, so you'll have to excuse a bit of 
confusion. In Python 2:

(1) str means *byte string*, NOT text string, and is limited 
    to "chars" with ordinal values 0 to 255;

(2) unicode means "text string";

(3) In an attempt to be helpful, Python 2 will try to automatically
    convert to and from bytes strings as needed. This works so long
    as all your characters are ASCII, but leads to chaos, confusion
    and error as soon as you have non-ASCII characters involved.

Python 3 fixes these confusing features.

Remember two facts:

(1) To go from TEXT to BYTES (i.e. unicode -> str) use ENCODE;

(2) To go from BYTES to TEXT (i.e. str -> unicode) use DECODE.

but you must be careful to prevent Python doing those automatic 
conversions first.

Looking at your code:

    str(info).encode("utf8")

that's wrong, because it tries to go from str->unicode using encode. But 
using decode also gives the same error. That hints that the error is 
happening in the call to str() first.

Firstly, we need to know what info is. Run this:

print type(info)
print repr(info)
print str(info)

and report any errors and output. I'm going to assume that info is a 
unicode object. Why? Because that will give the error you experience:

py> info = u'abcµ'
py> str(info)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in 
position 3: ordinal not in range(128)

The right way to convert unicode text to a byte str is with the encode 
method. Unless you have good reason to use another encoding, always use 
UTF-8 (which I see you are doing, great).

py> info.encode('utf-8')
'abc\xc2\xb5'

If all your Unicode text strings are valid and correct, that should be 
all you need, but if you are paranoid and fear "invalid" Unicode 
strings, which can theoretically happen (ask me how if you care), you 
can take a belt-and-braces approach and preemptively deal with errors by 
converting them to question marks.

NOTE THAT THIS THROWS AWAY INFORMATION FROM YOUR UNICODE TEXT.

If your paranoia exceeds your fear of losing information, you can 
instruct Python to use a ? any time there is an encoding error:

info.encode('utf-8', errors='replace')

So to recap:

- you have a variable `info`, which I am guessing is unicode

- you can convert it to a byte str with:

    info.encode('utf-8')

  or for the paranoid:

    info.encode('utf-8', errors='replace')

Now that you have a byte string, you can just write it out to the CSV 
file.

To read it back in, you read the CSV file, which returns a byte str, and 
then convert back to Unicode with:

    info = data.decode('utf-8')

> unicode(info, "utf8")

When you run this, what exception do you get? My guess is that you get 
the following TypeError:

py> unicode(u'abc', 'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

> csvFile = open("myFile.csv", "wb", encoding="utf-8") #invalid keyword
> argument

Python 3 allows you to set the encoding of files, Python 2 doesn't. In 
Python 2 you can use the io module, but note that this won't help you as 
(1) the csv module doesn't support Unicode, and (2) your problem lies 
elsewhere.

P.S. don't feel bad if the whole Unicode thing is confusing you. Most 
people go through a period of confusion, because you have to unlearn 
nearly everything you thought you knew about text in computers before 
you can really get Unicode.

-- 
Steve