Yet another "split string by spaces preserving single quotes" problem

Mon May 14 20:23:50 EDT 2012

On 05/13/12 16:14, Massi wrote:
> Hi everyone,
> I know this question has been asked thousands of times, but in my case
> I have an additional requirement to be satisfied. I need to handle
> substrings in the form 'string with spaces':'another string with
> spaces' as a single token; I mean, if I have this string:
> 
> s ="This is a 'simple test':'string which' shows 'exactly my'
> problem"
> 
> I need to split it as follow (the single quotes must be mantained in
> the splitted list):

The "quotes must be maintained" bit is what makes this different
from most common use-cases.  Without that condition, using
shlex.split() from the standard library does everything else that
you need.  Alternatively, one might try hacking csv.reader() to do
the splitting for you, though I had less luck than with shlex.

> Up to know I have written some ugly code which uses regular
> expression:
> 
> splitter = re.compile("(?=\s|^)('[^']+') | ('[^']+')(?=\s|$)")

You might try

 r = re.compile(r"""(?:'[^']*'|"[^"]*"|[^'" ]+)+""")
 print r.findall(s)

which seems to match your desired output.  It doesn't currently
handle tabs, but by breaking it out, it's easy to modify (and may
help understand what it's doing)

>>> single_quoted = "'[^']*'"
>>> double_quoted = '"[^"]*"'
>>> other = """[^'" \t]+"""  # added a "\t" tab here
>>> matches = '|'.join((single_quoted, double_quoted, other))
>>> regex = r'(?:%s)+' % matches
>>> r = re.compile(regex)
>>> r.findall(s)
['This', 'is', 'a', "'simple test':'string which'", 'shows',
"'exactly my'", 'problem']

Hope this helps,

-tkc