how to fast processing one million strings to remove quotes

Peter Otten __peter__ at web.de
Fri Aug 4 02:52:00 EDT 2017


Tim Daneliuk wrote:

> On 08/02/2017 10:05 AM, Daiyue Weng wrote:
>> Hi, I am trying to removing extra quotes from a large set of strings (a
>> list of strings), so for each original string, it looks like,
>> 
>> """str_value1"",""str_value2"",""str_value3"",1,""str_value4"""
>> 
>> 
>> I like to remove the start and end quotes and extra pairs of quotes on
>> each string value, so the result will look like,
>> 
>> "str_value1","str_value2","str_value3",1,"str_value4"
> 
> <SNIP>
> 
> This part can also be done fairly efficiently with sed:
> 
> time cat hugequote.txt | sed 's/"""/"/g;s/""/"/g' >/dev/null
> 
> real    0m2.660s
> user    0m2.635s
> sys     0m0.055s
> 
> hugequote.txt is a file with 1M copies of your test string above in it.
> 
> Run on a quad core i5 on FreeBSD 10.3-STABLE.

It looks like Python is fairly competetive:

$ wc -l hugequote.txt 
1000000 hugequote.txt
$ cat unquote.py 
import csv

with open("hugequote.txt") as instream:
    for field, in csv.reader(instream):
        print(field)

$ time python3 unquote.py > /dev/null

real    0m3.773s
user    0m3.665s
sys     0m0.082s

$ time cat hugequote.txt | sed 's/"""/"/g;s/""/"/g' > /dev/null

real    0m4.862s
user    0m4.721s
sys     0m0.330s

Run on ancient AMD hardware ;)




More information about the Python-list mailing list