slow loop?

maney at pobox.com maney at pobox.com
Thu Jan 16 10:44:52 EST 2003


Brandon Beck <bbeck at nospam.austin.rr.com> wrote:
> Seems like you're trying to write a CSV parser.  If so, I strongly 
> suggest you get the Object Craft CSV module.  It handles all of the 

OTOH, if you want a Python implementation for basic CSV, here's what I
did after finding the one (several?) pure-Python version(s) listed in
the Vaults had glitches.  The performance is good enough that on a
P2/233 that it doesn't annoy me when processing a file with about 7000
rather long lines (150 or so characters, two dozen or so fields).  I
have nothing against fast C implementations, but sometimes you want to
be able to share the program with others without requiring them to muck
about with installing other stuff.

It's designed to be used in what seems like an obvious way despite my
previous work having used a parse that wanted to massage the entire
input file at once (to be fair, it was trying to guess the style of
CSV-like thing you had, and generally did a fair job of it):

import csv

for l in someFile.xreadlines():
    fields = csv.split(l)
    ...



# $Id: csv.py,v 1.3 2003/01/13 21:00:28 maney Exp $

#import string
#import re

import exceptions

class CSVError(exceptions.Exception):
    pass


def split(s):
    """\
    Split the argument string s into a list of strings, one element for each
    CSV-formatted field in s.  This simple version recognizes only
    comma-separated fields with double-quotes as optional delimiters; the
    usual hack of using '""' within a quoted field as an escaped double-
    quote is supported.  The input string may include a terminating newline,
    but it need not do so.

    On error, a CSVError exception is raised.  It carries three strings
    with it: a description of the error, the portion of the input string
    that has been processed successfully, and the unprocessed tail that
    contains the error.
    """
    res = []
    i = 0
    start = i
    end = i
    n = len(s)
    if n > 0 and s[-1] == '\n':
        n = n - 1

    while 1:
        #
        # the current character is the start of a field; either a quoted field.
        #
        if i < n and s[i] == '"':
            i += 1
            start = i                               # start is first data char
            end = -1                                # end < start: not found ye
            while i < n:
                j = s.find('"', i)
                if j < 0:                           # oops, no quote found
                    break
                if j + 1 < n and s[j + 1] == '"':   # doubled quote: pass it
                    i = j + 2
                else:                               # must be the closing quote
                    i = j + 1
                    end = j
                    break  
            if end < start:
                raise CSVError('ill-formed field: no closing quote', s[:start-1
            field = '"'.join(s[start:end].split('""'))
        #
        # ... or an unquoted field
        #
        else:
            start = i
            j = s.find(',', i)
            if j >= 0:
                i = j
            else:
                i = n
            field = s[start:i]
        #
        # append field to result list, see if there's another to parse
        #
        res.append(field)

        if n <= i:
            break

        if s[i] == ',':
            i += 1   
        elif i < n:
            raise CSVError('ill-formed line: start of field not found', s[:i],

    return res





More information about the Python-list mailing list