CSV reader ignore brackets

Cameron Simpson cs at cskk.id.au
Tue Sep 24 19:09:02 EDT 2019


On 24Sep2019 15:55, Mihir Kothari <mihir.kothari at gmail.com> wrote:
>I am using python 3.4. I have a CSV file as below:
>
>ABC,PQR,(TEST1,TEST2)
>FQW,RTE,MDE

Really? No quotes around the (TEST1,TEST2) column value? I would have 
said this is invalid data, but that does not help you.

>Basically comma-separated rows, where some rows have a data in column which
>is array like i.e. in brackets.
>So I need to read the file and treat such columns as one i.e. do not
>separate based on comma if it is inside the bracket.
>
>In short I need to read a CSV file where separator inside the brackets
>needs to be ignored.
>
>Output:
>Column:   1       2                3
>Row1:    ABC  PQR  (TEST1,TEST2)
>Row2:    FQW  RTE  MDE
>
>Can you please help with the snippet?

I would be reaching for a regular expression. If you partition your 
values into 2 types: those starting and ending in a bracket, and those 
not, you could write a regular expression for the former:

    \([^)]*\)

which matches a string like (.....) (with, importantly, no embedded 
brackets, only those at the beginning and end.

And you can write a regular expression like:

    [^,]*

for a value containing no commas i.e. all the other values.

Test the bracketed one first, because the second one always matches  
something.

Then you would not use the CSV module (which expects better formed data 
than you have) and instead write a simple parser for a line of text 
which tries to match one of these two expressions repeatedly to consume 
the line. Something like this (UNTESTED):

    bracketed_re = re.compile(r'\([^)]*\)')
    no_commas_re = re.compile(r'[^,]*')

    def split_line(line):
      line = line.rstrip()  # drop trailing whitespace/newline
      fields = []
      offset = 0
      while offset < len(line):
        m = bracketed_re.match(line, offset)
        if m:
          field = m.group()
        else:
          m = no_commas_re.match(line, offset)   # this always matches
          field = m.group()
        fields.append(field)
        offset += len(field)
        if line.startswith(',', offset):
          # another column
          offset += 1
        elif offset < len(line):
          raise ValueError(
            "incomplete parse at offset %d, line=%r" % (offset, line))
      return fields

Then read the lines of the file and split them into fields:

    row = []
    with open(datafilename) as f:
      for line in f:
        fields = split_line(line)
        rows.append(fields)

So basicly you're writing a little parser. If you have nested brackets 
things get harder.

Cheers,
Cameron Simpson <cs at cskk.id.au>



More information about the Python-list mailing list