Regular expression query

Jussi Piitulainen jussi.piitulainen at helsinki.fi
Sun Mar 12 13:24:56 EDT 2017


rahulrasal at gmail.com writes:

> Hi All,
>
> I have a string which looks like
>
> aaaaa,bbbbb,ccccc "4873898374", ddddd, eeeeee "3343,23,23,5,,5,45", fffff "5546,3434,345,34,34,5,34,543,7"
>
> It is comma saperated string, but some of the fields have a double
> quoted string as part of it (and that double quoted string can have
> commas).  Above string have only 6 fields. First is aaaaa, second is
> bbbbb and last is fffff "5546,3434,345,34,34,5,34,543,7".  How can I
> split this string in its fields using regular expression ? or even if
> there is any other way to do this, please speak out.

If you have any control over the source of this data, try to change the
source so that it writes proper CSV. Then you can use the csv module to
parse the data.

As it is, csv.reader failed me. Perhaps someone else knows how it should
be parameterized to deal with this?

len(next(csv.reader(io.StringIO(s)))) == 20
len(next(csv.reader(io.StringIO(s), doublequote = False))) == 20

Here's a regex solution that assumes that there is something in a field
before the doublequoted part, then at most one doublequoted part and
nothing after the doublequoted part.

len(re.findall(r'([^",]+(?:"[^"]*")?)', s)) == 6

re.findall(r'([^",]+(?:"[^"]*")?)', s)
['aaaaa',
'bbbbb',
'ccccc "4873898374"',
' ddddd',
' eeeeee "3343,23,23,5,,5,45"',
' fffff "5546,3434,345,34,34,5,34,543,7"']

The outermost parentheses in the pattern make the whole pattern a
capturing group. They are redundant above (with re.findall) but
important in the following alternative (with re.split).

re.split(r'([^",]+(?:"[^"]*")?)', s)
['', 'aaaaa',
',', 'bbbbb',
',', 'ccccc "4873898374"',
',', ' ddddd',
',', ' eeeeee "3343,23,23,5,,5,45"',
',', ' fffff "5546,3434,345,34,34,5,34,543,7"',
'']

This splits the string with the pattern that matches the actual data.
With the capturing group it also returns the actual data. One could then
check that the assumptions hold and every other value is just a comma.

I would make that check and throw an exception on failure.



More information about the Python-list mailing list