how to fast processing one million strings to remove quotes

Wed Aug 2 23:16:02 EDT 2017

On Thursday, 3 August 2017 01:05:57 UTC+10, Daiyue Weng  wrote:
> Hi, I am trying to removing extra quotes from a large set of strings (a
> list of strings), so for each original string, it looks like,
> 
> """str_value1"",""str_value2"",""str_value3"",1,""str_value4"""
> 
> 
> I like to remove the start and end quotes and extra pairs of quotes on each
> string value, so the result will look like,
> 
> "str_value1","str_value2","str_value3",1,"str_value4"
> 
> 
> and then join each string by a new line.
> 
> I have tried the following code,
> 
> for line in str_lines[1:]:
>             strip_start_end_quotes = line[1:-1]
>             splited_line_rem_quotes =
> strip_start_end_quotes.replace('\"\"', '"')
>             str_lines[str_lines.index(line)] = splited_line_rem_quotes
> 
> for_pandas_new_headers_str = '\n'.join(splited_lines)
> 
> but it is really slow (running for ages) if the list contains over 1
> million string lines. I am thinking about a fast way to do that.
> 
> I also tried to multiprocessing this task by
> 
> def preprocess_data_str_line(data_str_lines):
>     """
> 
>     :param data_str_lines:
>     :return:
>     """
>     for line in data_str_lines:
>         strip_start_end_quotes = line[1:-1]
>         splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"',
> '"')
>         data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes
> 
>     return data_str_lines
> 
> 
> def multi_process_prepcocess_data_str(data_str_lines):
>     """
> 
>     :param data_str_lines:
>     :return:
>     """
>     # if cpu load < 25% and 4GB of ram free use 3 cores
>     # if cpu load < 70% and 4GB of ram free use 2 cores
>     cores_to_use = how_many_core()
> 
>     data_str_blocks = slice_list(data_str_lines, cores_to_use)
> 
>     for block in data_str_blocks:
>         # spawn processes for each data string block assigned to every cpu
> core
>         p = multiprocessing.Process(target=preprocess_data_str_line,
> args=(block,))
>         p.start()
> 
> but I don't know how to concatenate the results back into the list so that
> I can join the strings in the list by new lines.
> 
> So, ideally, I am thinking about using multiprocessing + a fast function to
> preprocessing each line to speed up the whole process.
> 
> cheers

Hi MRAB,

My first thought is to use split/join to solve this problem, but you would need to decide what to do with the non-strings in your 1,000,000 element list. You also need to be sure that the pipe character | is in none of your strings.

split_on_dbl_dbl_quote = original_list.join('|').split('""')
remove_dbl_dbl_quotes_and_outer_quotes = split_on_dbl_dbl_quote[::2].join('').split('|')

You need to be sure of your data: [::2] (return just even-numbered elements) relies on all double-double-quotes both opening and closing within the same string.

This runs in under a second for a million strings but does affect *all* elements, not just strings. The non-strings would become strings after the second statement.

As to multi-processing: I would be looking at well-optimised single-thread solutions like split/join before I consider MP. If you can fit the problem to a split-join it'll be much simpler and more "pythonic".

Cheers,

Nick