how to fast processing one million strings to remove quotes

Daiyue Weng daiyueweng at gmail.com
Wed Aug 2 11:05:24 EDT 2017


Hi, I am trying to removing extra quotes from a large set of strings (a
list of strings), so for each original string, it looks like,

"""str_value1"",""str_value2"",""str_value3"",1,""str_value4"""


I like to remove the start and end quotes and extra pairs of quotes on each
string value, so the result will look like,

"str_value1","str_value2","str_value3",1,"str_value4"


and then join each string by a new line.

I have tried the following code,

for line in str_lines[1:]:
            strip_start_end_quotes = line[1:-1]
            splited_line_rem_quotes =
strip_start_end_quotes.replace('\"\"', '"')
            str_lines[str_lines.index(line)] = splited_line_rem_quotes

for_pandas_new_headers_str = '\n'.join(splited_lines)

but it is really slow (running for ages) if the list contains over 1
million string lines. I am thinking about a fast way to do that.

I also tried to multiprocessing this task by

def preprocess_data_str_line(data_str_lines):
    """

    :param data_str_lines:
    :return:
    """
    for line in data_str_lines:
        strip_start_end_quotes = line[1:-1]
        splited_line_rem_quotes = strip_start_end_quotes.replace('\"\"',
'"')
        data_str_lines[data_str_lines.index(line)] = splited_line_rem_quotes

    return data_str_lines


def multi_process_prepcocess_data_str(data_str_lines):
    """

    :param data_str_lines:
    :return:
    """
    # if cpu load < 25% and 4GB of ram free use 3 cores
    # if cpu load < 70% and 4GB of ram free use 2 cores
    cores_to_use = how_many_core()

    data_str_blocks = slice_list(data_str_lines, cores_to_use)

    for block in data_str_blocks:
        # spawn processes for each data string block assigned to every cpu
core
        p = multiprocessing.Process(target=preprocess_data_str_line,
args=(block,))
        p.start()

but I don't know how to concatenate the results back into the list so that
I can join the strings in the list by new lines.

So, ideally, I am thinking about using multiprocessing + a fast function to
preprocessing each line to speed up the whole process.

cheers



More information about the Python-list mailing list