pandas read_csv

Fri Nov 9 12:15:37 EST 2018

Sharan Basappa wrote:

> are there any requirements about the format of the CSV file when using
> read_csv from pandas? For example, is it necessary that the csv file has
> to have same number of columns in every line etc.

> ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, 
saw 3

The error message is quite clear, look for extra fields in line 8 of your 
data ;)

Now let's make a few experiments:

>>> import pandas, io
>>> def dump(s):
...     return pandas.read_csv(io.StringIO(s))
... 
>>> dump("""foo,bar
... 1,2
... """
... )
   foo  bar
0    1    2

[1 rows x 2 columns]
>>> dump("""foo,bar
... 1,2,3
... 4,5
... """)
   foo  bar
1    2    3
4    5  NaN

[2 rows x 2 columns]
>>> dump("""foo,bar
... 1,2
... 3,4,5
... """)
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "<stdin>", line 2, in dump
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 420, in 
parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 225, in 
_read
    return parser.read()
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 626, in 
read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1070, in 
read
    data = self._reader.read(nrows)
  File "parser.pyx", line 727, in pandas.parser.TextReader.read 
(pandas/parser.c:6937)
  File "parser.pyx", line 749, in pandas.parser.TextReader._read_low_memory 
(pandas/parser.c:7156)
  File "parser.pyx", line 802, in pandas.parser.TextReader._read_rows 
(pandas/parser.c:7757)
  File "parser.pyx", line 789, in pandas.parser.TextReader._tokenize_rows 
(pandas/parser.c:7640)
  File "parser.pyx", line 1697, in pandas.parser.raise_parser_error 
(pandas/parser.c:19092)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 
fields in line 3, saw 3

>From this I infer that no row in the csv file may contain more columns than 
the first data row. Missing columns are added automatically.

There is also an option to suppress rows containing too many columns:

>>> pandas.read_csv(io.StringIO("foo,bar\n1,2\n3,4,5\n6,7"), 
error_bad_lines=False)
b'Skipping line 3: expected 2 fields, saw 3\n'
   foo  bar
0    1    2
1    6    7

[2 rows x 2 columns]