[Tutor] UnicodeDecodeError while parsing a .csv file.

Tue Oct 29 00:31:54 CET 2013

On 10/28/2013 6:13 PM, SM wrote:
 > Hello,
Hi welcome to the Tutor list

 > I have an extremely simple piece of code

which could be even simpler - see my comments below

 > which reads a .csv file, which has 1000 lines of fixed fields, one 
line at a time, and tries to print some values.
 >
 >   1 #!/usr/bin/python3
 >   2 #
 >   3 import sys, time, re, os
 >   4
 >   5 if __name__=="__main__":
 >   6
 >   7     ifd = open("infile.csv", 'r')

The simplest way to discard the first line is to follow the open with
8     ifd.readline()

The simplest way to track line number is

10     for linenum, line in enumerate(ifd, 1):

 >  11         line1 = line.split(",")

FWIW you don't need re to do this split

 >  12         total = 0
 >  19         print("LINE: ", linenum, line1[1])
 >  20         for i in range(1,8):
 >  21             if line1[i].strip():
 >  22                 print("line[i] ", int(line1[i]))
 >  23                 total = total + int(line1[i])
 >  24         print("Total: ", total)
 >  25
 >  26         if total >= 4:
 >  27             print("POSITIVE")
 >  28         else:
 >  29             print("Negative")
 >  31     ifd.close()

That should have () after it, since it is a method call.
 >
 > It works fine till  it parses the 1st 126 lines in the input file. 
For the 127th line (irrespective of the contents of the actual line), it 
prints the following error:
 > Traceback (most recent call last):
 >   File "p1.py", line 10, in <module>
 >     for line in ifd:
 >   File "/usr/lib/python3.2/codecs.py", line 300, in decode
 >     (result, consumed) = self._buffer_decode(data, self.errors, final)
 > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 
1173: invalid continuation byte
Do you get exactly the same message irrespective of the contents of the 
actual line?

"Code points larger than 127 are represented by multi-byte sequences, 
composed of a leading byte and one or more continuation bytes. The 
leading byte has two or more high-order 1s followed by a 0, while 
continuation bytes all have '10' in the high-order position."

This suggests that a byte close to the end of the previous line is 
"leading byte"and therefore a continuation byte was expected but where 
the 0xe9was found.

BTWhen I divide 1173 by 126 I get something close to 9 characters per 
lne. That is not possible, as there would have to be at least 16 
characters in each line.

Best you send us at least the first 130 lines so we can play with the file.

-- 
Bob Gailer
919-636-4239
Chapel Hill NC