CSV module: incorrectly parsed file.

Paul McGuire ptmcg at austin.rr.com
Sun Feb 17 21:57:52 EST 2008


On Feb 17, 8:09 pm, Christopher Barrington-Leigh
<christophe... at gmail.com> wrote:
> Here is a file "test.csv"
> number,name,description,value
> 1,"wer","tape 2"",5
> 1,vvv,"hoohaa",2
>
> I want to convert it to tab-separated without those silly quotes. Note
> in the second line that a field is 'tape 2"' , ie two inches: there is
> a double quote in the string.
>

What is needed to disambiguate this data is to only accept closing
quotes if they are followed by a comma or the end of the line.  In
pyparsing, you can define your own quoted string format.  Here is one
solution using pyparsing.  At the end, you can extract the data by
field name, and print it out however you choose:

data = """\
number,name,description,value
1,"wer","tape 2"",5
1,vvv,"hoohaa",2"""


from pyparsing import *

# very special definition of a quoted string, that ends with a " only
if
# followed by a , or the end of line
quotedString = ('"' +
    ZeroOrMore(CharsNotIn('"')|('"' + ~FollowedBy(','|lineEnd))) +
    '"')
quotedString.setParseAction(keepOriginalText, removeQuotes)
integer = Word(nums).setParseAction(lambda toks:int(toks[0]))
value = integer | quotedString | Word(printables.replace(",",""))

# first pass, just parse the comma-separated values
for line in data.splitlines():
    print delimitedList(value).parseString(line)
print

# now second pass, assign field names using names from first line
names = data.splitlines()[0].split(',')
def setValueNames(tokens):
    for k,v in zip(names,tokens):
        tokens[k] = v
lineDef = delimitedList(value).setParseAction(setValueNames)

# parse each line, and extract data by field name
for line in data.splitlines()[1:]:
    results = lineDef.parseString(line)
    print "Desc:", results.description
    print results.dump()


Prints:
['number', 'name', 'description', 'value']
[1, 'wer', 'tape 2"', 5]
[1, 'vvv', 'hoohaa', 2]

Desc: tape 2"
[1, 'wer', 'tape 2"', 5]
- description: tape 2"
- name: wer
- number: 1
- value : 5
Desc: hoohaa
[1, 'vvv', 'hoohaa', 2]
- description: hoohaa
- name: vvv
- number: 1
- value : 2

-- Paul




More information about the Python-list mailing list