Python replace multiple strings (m*n) combination

Cem Karan cfkaran2 at gmail.com
Sat Feb 25 15:13:14 EST 2017


Another possibility is to form a suffix array (https://en.wikipedia.org/wiki/Suffix_array#Applications) as an index for the string, and then search for patterns within the suffix array.  The basic idea is that you index the string you're searching over once, and then look for patterns within it.  

The main problem with this method is how you're doing the replacements.  If your replacement text can create a new string that matches a different regex that occurs later on, then you really should use what INADA Naoki suggested.

Thanks,
Cem Karan

On Feb 25, 2017, at 2:08 PM, INADA Naoki <songofacandy at gmail.com> wrote:

> If you can use third party library, I think you can use Aho-Corasick algorithm.
> 
> https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
> 
> https://pypi.python.org/pypi/pyahocorasick/
> 
> On Sat, Feb 25, 2017 at 3:54 AM,  <kar6308 at gmail.com> wrote:
>> I have a task to search for multiple patterns in incoming string and replace with matched patterns, I'm storing all pattern as keys in dict and replacements as values, I'm using regex for compiling all the pattern and using the sub method on pattern object for replacement. But the problem I have a tens of millions of rows, that I need to check for pattern which is about 1000 and this is turns out to be a very expensive operation.
>> 
>> What can be done to optimize it. Also I have special characters for matching, where can I specify raw string combinations.
>> 
>> for example is the search string is not a variable we can say
>> 
>> re.search(r"\$%^search_text", "replace_text", "some_text") but when I read from the dict where shd I place the "r" keyword, unfortunately putting inside key doesnt work "r key" like this....
>> 
>> Pseudo code
>> 
>> for string in genobj_of_million_strings:
>>   pattern = re.compile('|'.join(regex_map.keys()))
>>   return pattern.sub(lambda x: regex_map[x], string)
>> --
>> https://mail.python.org/mailman/listinfo/python-list
> -- 
> https://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list