Python replace multiple strings (m*n) combination

INADA Naoki songofacandy at gmail.com
Sat Feb 25 14:08:51 EST 2017


If you can use third party library, I think you can use Aho-Corasick algorithm.

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

https://pypi.python.org/pypi/pyahocorasick/

On Sat, Feb 25, 2017 at 3:54 AM,  <kar6308 at gmail.com> wrote:
> I have a task to search for multiple patterns in incoming string and replace with matched patterns, I'm storing all pattern as keys in dict and replacements as values, I'm using regex for compiling all the pattern and using the sub method on pattern object for replacement. But the problem I have a tens of millions of rows, that I need to check for pattern which is about 1000 and this is turns out to be a very expensive operation.
>
> What can be done to optimize it. Also I have special characters for matching, where can I specify raw string combinations.
>
> for example is the search string is not a variable we can say
>
> re.search(r"\$%^search_text", "replace_text", "some_text") but when I read from the dict where shd I place the "r" keyword, unfortunately putting inside key doesnt work "r key" like this....
>
> Pseudo code
>
> for string in genobj_of_million_strings:
>    pattern = re.compile('|'.join(regex_map.keys()))
>    return pattern.sub(lambda x: regex_map[x], string)
> --
> https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list