how to fast processing one million strings to remove quotes

Wed Aug 2 13:05:52 EDT 2017

On 2017-08-02 16:05, Daiyue Weng wrote:
> Hi, I am trying to removing extra quotes from a large set of strings (a
> list of strings), so for each original string, it looks like,
> 
> """str_value1"",""str_value2"",""str_value3"",1,""str_value4"""
> 
> 
> I like to remove the start and end quotes and extra pairs of quotes on each
> string value, so the result will look like,
> 
> "str_value1","str_value2","str_value3",1,"str_value4"
> 
> 
> and then join each string by a new line.
> 
> I have tried the following code,
> 
> for line in str_lines[1:]:
>              strip_start_end_quotes = line[1:-1]
>              splited_line_rem_quotes =
> strip_start_end_quotes.replace('\"\"', '"')
>              str_lines[str_lines.index(line)] = splited_line_rem_quotes
> 
> for_pandas_new_headers_str = '\n'.join(splited_lines)
> 
> but it is really slow (running for ages) if the list contains over 1
> million string lines. I am thinking about a fast way to do that.
> 
[snip]

The problem is the line:

     str_lines[str_lines.index(line)]

It does a linear search through str_lines until time finds a match for 
the line.

To find the 10th line it must search through the first 10 lines.

To find the 100th line it must search through the first 100 lines.

To find the 1000th line it must search through the first 1000 lines.

And so on.

In Big-O notation, the performance is O(n**2).

The Pythonic way of doing it is to put the results into a new list:

new_str_lines = str_lines[:1]

for line in str_lines[1:]:
     strip_start_end_quotes = line[1:-1]
     splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"', '"')
     new_str_lines.append(splited_line_rem_quotes)

In Big-O notation, the performance is O(n).