Split string but ignore quotes

MRAB python at mrabarnett.plus.com
Tue Sep 29 11:56:00 EDT 2009


Björn Lindqvist wrote:
> 2009/9/29 Scooter <slbentley at gmail.com>:
>> I'm attempting to reformat an apache log file that was written with a
>> custom output format. I'm attempting to get it to w3c format using a
>> python script. The problem I'm having is the field-to-field matching.
>> In my python code I'm using split with spaces as my delimiter. But it
>> fails when it reaches the user agent because that field itself
>> contains spaces. But that user agent is enclosed with double quotes.
>> So is there a way to split on a certain delimiter but not to split
>> within quoted words.
>>
>> i.e. a line might look like
>>
>> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
>> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
>> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
>> 1923 1360 31715 -
> 
> Try shlex:
> 
>>>> import shlex
>>>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
>>>> shlex.split(s)
> ['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
> MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
> Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
> 'http://somehost.com', '200']
> 
The regex solution is:

 >>> import re
 >>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 
7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
 >>> re.findall(r'".*?"|\S+', s)
['2009-09-29', '12:00:00', '-', 'GET', '/', '"Mozilla/4.0 (compatible; 
MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center 
PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"', 
'http://somehost.com', '200']



More information about the Python-list mailing list