doing hundreds of re.subs efficiently on large strings

nihilo exnihilo at NOmyrealCAPSbox.com
Sun Apr 6 20:28:13 EDT 2003


Bengt Richter wrote:
> 
> Thanks for testing, that is interesting. It would be interesting to see
> the test code you ran, so we could see the reasons for .25 vs .43 (and
> I could revise my reality-model as necessary ;-)
> 
> Regards,
> Bengt Richter

Hi Bengt,

Here is some code that I created to test the two versions.

Original:


import time, re
start = time.clock()
data = '' # get http://news.google.com for sample data to run this on. I 
was just putting this in the file for testing to avoid time 
complications over the network. I've omitted it here because it is more 
than 50KB.

data = data.replace('US and Britain', 'Oceania')
data = data.replace('US and British', 'Oceanian')
data = data.replace('America and Britain', 'Oceania')
data = data.replace('the U.S.A.', 'Oceania')
data = data.replace('the United States of America', 'Oceania')
data = data.replace('the U.S.', 'Oceania')
data = data.replace('the USA', 'Oceania')
data = data.replace('the US', 'Oceania')
data = data.replace('The US', 'Oceania')
data = data.replace('the United States', 'Oceania')
data = data.replace('U.S.A.', 'Oceania')
data = data.replace('United States of America', 'Oceania')
data = data.replace('U.S.', 'Oceania')
data = data.replace(' US', ' Oceania')
data = data.replace('US ', 'Oceania ')
data = data.replace('US,', 'Oceania,')
data = data.replace('US-', 'Oceania-')
data = data.replace('United States', 'Oceania')
data = data.replace('American', 'Oceanian')
data = data.replace('British', 'Oceanian')
data = data.replace('Great Britain', 'Oceania')
data = data.replace('Britain', 'Oceania')
data = data.replace('United Kingdom', 'Oceania')
data = data.replace('European Union countries', 'Oceania provinces')
data = data.replace('European nations', 'Oceania provinces')
data = data.replace('European', 'Oceanian')
data = data.replace('Europe', 'Oceania')
data = data.replace('the EU', 'Oceania')
data = data.replace('Australian', 'Oceanian')
data = data.replace('Australia', 'Oceania')
data = data.replace('Washington Post', 'Ministry of Truth')
data = data.replace('Washington DC', 'Airstrip One')
data = data.replace('Washington D.C.', 'Airstrip One')
data = data.replace('Washington, D.C.', 'Airstrip One')
data = data.replace('Washington, DC', 'Airstrip One')
data = data.replace('Washington', 'Airstrip One')
data = data.replace('WASHINGTON', 'Airstrip One')
data = data.replace('Boy Scouts', 'Junior Anti-Sex League')
data = data.replace('Girl Scouts', 'Junior Anti-Sex League')
data = data.replace('President Hussein', 'Emmanuel Goldstein, Enemy of 
the People')
data = data.replace('Saddam Hussein', 'Emmanuel Goldstein')
data = data.replace('Saddam', 'Goldstein')
data = data.replace('Hussein', 'Goldstein')
data = data.replace('President Bush', 'Big Brother')
data = data.replace('president Bush', 'Big Brother')
data = data.replace('George Bush', 'Big Brother')
data = data.replace('George W. Bush', 'Big Brother')
data = data.replace('President George W Bush', 'Big Brother')
data = data.replace('George W Bush', 'Big Brother')
data = data.replace('Mr. Bush', 'Big Brother')
data = data.replace('Bush', 'Big Brother')
data = data.replace('Donald Rumsfeld', 'O\' Brien')
data = data.replace('Rumsfeld', "O' Brien")
data = data.replace('United Nations', 'bloc nations')
data = data.replace('UN', 'bloc nations')
data = data.replace('U.N.', 'bloc nations')
data = data.replace('coalition allies', 'bloc allies')
data = data.replace(' allies', ' bloc allies')
data = data.replace('Japanese', 'Eastasian')
data = data.replace('Chinese', 'Eastasian')
data = data.replace('Japan', 'Eastasia')
data = data.replace('China', 'Eastasia')
data = data.replace('North Korea', 'Eastasia')
data = data.replace('South Korea', 'Eastasia')
data = data.replace('India', 'Eastasia')
data = data.replace('Hong Kong', 'Eastasia')
data = data.replace('Vietnam', 'Eastasia')
data = data.replace('Singapore', 'Eastasia')
data = data.replace('the Phillipines', 'Eastasia')
data = data.replace('Thailand', 'Eastasia')
data = data.replace('Sri Lanka', 'Eastasia')
data = data.replace('Mexico', 'Oceania')
data = data.replace('Canadian', 'Oceanian')
data = data.replace('Canada', 'Oceania')
data = data.replace('Federal Bureau of Investigation', 'Ministry of Love')
data = data.replace('Department of State', 'Ministry of Truth')
data = data.replace('Pentagon', 'Ministry of Peace')
data = data.replace(' Post', ' Ministry of Truth')
data = data.replace('Russian', 'Eastasian')
data = data.replace('Russia', 'Eastasia')
data = data.replace('Australian', 'Oceanian')
data = data.replace('Australia', 'Oceania')
data = data.replace('Christian Science Monitor', 'INGSOC Science Monitor')
data = data.replace('Christianity', 'INGSOC')
data = data.replace('Washington Post', 'Ministry of Truth')
data = data.replace('Washington Post', 'Ministry of Truth')
data = data.replace('Daily Telegraph', 'Ministry of Truth')
data = data.replace('New York Times', 
'Ministry of Truth')
data = data.replace('International Herald Tribune', 'Ministry of Truth')
data = data.replace('International Herald Tribune', 
'Ministry of Truth')
data = data.replace('BBC', 'Ministry of Truth')
data = data.replace('wartime', 'peacetime')
data = data.replace('congressional', 'Inner Party')
data = data.replace('Senate', 'Inner Party Senate')
data = data.replace('House of Representatives', 'Inner Party House')
data = data.replace('House', 'Inner Party House')
data = data.replace('Congress', 'Inner Party')
data = data.replace('Marines', 'Peace Keepers')
data = data.replace('soldiers', 'peace keepers')
data = data.replace('Republican Guards', 'Resistance Elite Troops')
data = data.replace('Republican Guard', 'Resistance Elite Troops')
data = data.replace('Department of Defense', 'Ministry of Peace')
data = data.replace('Department of Homeland Security', 'Ministry of Love')
data = data.replace('Associated Press', 'Ministry of Truth')
data = data.replace('FBI', 'Ministry of Love')
data = data.replace('CNN', 'Ministry of Truth')
data = data.replace('New York Times', 'Ministry of Truth')
data = data.replace('UPI', 'Ministry of Truth')
data = data.replace('United Press International', 'Ministry of Truth')
data = data.replace('Federal Reserve', 'Ministry of Plenty')
data = data.replace('Homeland Security', 'Motherland Security')
data = data.replace('homeland security', 'motherland security')
data = data.replace('protesters', 'restistance thought criminals')
data = data.replace('prisoner', 'Ministry of Love prison guest')
# data = data.replace(' prison', 'Ministry of Love')
data = data.replace('solitary confinement', 'Room 101')
data = data.replace('anti-war', 'Resistance')
data = data.replace('Anti-war', 'Resistance')
data = data.replace('paramilitary', 'Ministry of Peace')
data = data.replace('para-military', 'Ministry of Peace')
data = data.replace('military', 'Ministry of Peace')
data = data.replace('Military', 'Ministry of Peace')
data = data.replace('Gulf War', 'Gulf Peace Conflict')
data = data.replace('news network', 'Ministry of Truth')
data = data.replace('Al-Qaeda', 'Resistance')
data = data.replace('Al Qaeda', 'Resistance')
data = data.replace('Al-Quaida', 'Resistance')
data = data.replace('al Qaeda', 'Resistance')
data = data.replace('Qaeda', 'Resistance')
data = data.replace('Taliban', 'Resistance')
data = data.replace('Taleban', 'Resistance')
data = data.replace("Royal Air Force", 'Ministry of Peace Air Force')
data = data.replace('air force', 'Ministry of Peace Air Force')
data = data.replace('Invasion', 'Liberation')
data = data.replace('invasion', 'liberation')
data = data.replace('\'shock and awe\'', 'awe-shock')
data = data.replace('\"shock and awe\"', 'awe-shock')
data = data.replace('weapons of mass destruction', 'gifts of mass coercion')
data = data.replace('demonstrators', 'thoughtcriminals')
data = data.replace('protestors', 'thoughtcriminals')

print time.clock() - start

And the modified code is:

import time, re
start = time.clock()
data = '' # same data
dict = { 'US and Britain' : 'Oceania', 'US and British' : 'Oceanian', 
'America and Britain' : 'Oceania','the U.S.A.' : 'Oceania', 'the United 
States of America' : 'Oceania', 'the U.S.' : 'Oceania', 'the USA' : 
'Oceania', 'the US' : 'Oceania', 'The US' : 'Oceania', 'the United 
States' : 'Oceania', 'U.S.A.' : 'Oceania', 'United States of America' : 
'Oceania', 'U.S.' : 'Oceania', ' US' : ' Oceania', 'US ' : 'Oceania ', 
'US,' : 'Oceania,', 'US-' : 'Oceania-', 'United States' : 'Oceania', 
'American' : 'Oceanian', 'British': 'Oceanian', 'Great Britain' : 
'Oceania', 'Britain' : 'Oceania', 'United Kingdom' : 'Oceania', 
'European Union countries' : 'Oceania provinces', 'European nations' : 
'Oceania provinces', 'European' : 'Oceanian', 'Europe' : 'Oceania', 'the 
EU' : 'Oceania', 'Australian' : 'Oceanian', 'Australia' : 'Oceania', 
'Washington Post' : 'Ministry of Truth', 'Washington DC' : 'Airstrip 
One', 'Washington D.C.' : 'Airstrip One','Washingto' : '.C.', 
'Washingto' : 'C', 'Washington' : 'Airstrip One', 'WASHINGTON' : 
'Airstrip One', 'Boy Scouts' : 'Junior Anti-Sex League', 'Girl Scouts' : 
'Junior Anti-Sex League', 'President Hussein' : 'Emmanuel Goldstei', 
'Saddam Hussein' : 'Emmanuel Goldstein', 'Saddam' : 'Goldstein', 
'Hussein' : 'Goldstein', 'President Bush' : 'Big Brother', 'president 
Bush' : 'Big Brother', 'George Bush' : 'Big Brother', 'George W. Bush' : 
'Big Brother', 'President George W Bush' : 'Big Brother', 'George W 
Bush' : 'Big Brother', 'Mr. Bush' : 'Big Brother', 'Bush' : 'Big 
Brother', 'Donald Rumsfeld' : 'O\' Brien', 'Rumsfeld' : 'O\'Brien', 
'United Nations' : 'bloc nations', 'UN' : 'bloc nations', 'U.N.' : 'bloc 
nations', 'coalition allies' : 'bloc allies', ' allies' : ' bloc 
allies', 'Japanese' : 'Eastasian', 'Chinese' : 'Eastasian', 'Japan' : 
'Eastasia', 'China' : 'Eastasia', 'North Korea' : 'Eastasia', 'South 
Korea' : 'Eastasia', 'India' : 'Eastasia', 'Hong Kong' : 'Eastasia', 
'Vietnam' : 'Eastasia', 'Singapore' : 'Eastasia', 'the Phillipines' : 
'Eastasia', 'Thailand' : 'Eastasia', 'Sri Lanka' : 'Eastasia', 'Mexico' 
: 'Oceania', 'Canadian' : 'Oceanian', 'Canada' : 'Oceania', 'Federal 
Bureau of Investigation' : 'Ministry of Love', 'Department of State' : 
'Ministry of Truth', 'Pentagon' : 'Ministry of Peace', ' Post' : ' 
Ministry of Truth', 'Russian' :'Eastasian', 'Russia' : 'Eastasia', 
'Australian' : 'Oceanian', 'Australia' : 'Oceania', 'Christian Science 
Monitor' : 'INGSOC Science Monitor', 'Christianity' : 'INGSOC', 
'Washington Post' : 'Ministry of Truth', 'Washington Post' : 
'Ministry of Truth', 'Daily Telegraph' : 
'Ministry of Truth', 'New York Times' : 
'Ministry of Truth', 'International Herald Tribune' : 
'Ministry of Truth', 'International Herald Tribune' : 
'Ministry of Truth', 'BBC' : 'Ministry of Truth', 'wartime' : 
'peacetime', 'congressional' : 'Inner Party', 'Senate' : 'Inner Party 
Senate', 'House of Representatives' : 'Inner Party House', 'House' : 
'Inner Party House', 'Congress' : 'Inner Party', 'Marines' : 'Peace 
Keepers', 'soldiers' : 'peace keepers', 'Republican Guards' : 
'Resistance EliteTroops', 'Republican Guard' : 'Resistance Elite 
Troops', 'Department of Defense' : 'Ministry of Peace', 'Department of 
Homeland Security' : 'Ministry of Love', 'Associated Press' : 'Ministry 
of Truth', 'FBI' :'Ministry of Love', 'CNN' : 'Ministry of Truth', 'New 
York Times' : 'Ministry of Truth', 'UPI' : 'Ministry of Truth', 'United 
Press International' : 'Ministry of Truth', 'Federal Reserve' : 
'Ministry of Plenty', 'Homeland Security' : 'Motherland Security', 
'homeland security' : 'motherland security', 'protesters': 'restistance 
thought criminals', 'prisoner' : 'Ministry of Love prison guest', 
'solitary confinement' : 'Room 101', 'anti-war' : 'Resistance', 
'Anti-war' : 'Resistance', 'paramilitary' : 'Ministry of 
Peace','para-military' : 'Ministry of Peace', 'military' : 'Ministry of 
Peace', 'Military' : 'Ministry of Peace', 'Gulf War' : 'Gulf Peace 
Conflict', 'news network' : 'Ministry of Truth', 'Al-Qaeda' : 
'Resistance', 'Al Qaeda' : 'Resistance', 'Al-Quaida' : 'Resistance', 'al 
Qaeda' : 'Resistance', 'Qaeda' : 'Resistance', 'Taliban' : 'Resistance', 
'Taleban' : 'Resistance', 'Royal Air Force' : 'Ministry of Peace Air 
Force', 'air force' : 'Ministry of Peace Air Force', 'Invasion' : 
'Liberation', 'invasion' : 'liberation', '\'shockand awe\'' : 
'awe-shock', '\"shock and awe\"' : 'awe-shock', 'weapons of mass 
destruction' : 'gifts of mass coercion', 'demonstrators' : 
'thoughtcriminals', 'protestors' : 'thoughtcriminals' }
pat = re.compile('(Canada|the U\.S\.A\.|air force|International Herald 
Tribune|Associated Press|prisoner|Great Britain|George W Bush|US and 
British|soldiers|Federal Bureau of 
Investigation|Europe|demonstrators|Australia|Congress|Saddam|the 
U\.S\.|Boy Scouts|solitary 
confinement|Taliban|Russian|Thailand|military|Hong 
Kong|congressional|protestors|Department of Defense|House of 
Representatives|Christian Science Monitor|Canadian|President Bush|the 
US|European 
nations|BBC|protesters|Daily Telegraph|Washington|American|invasion|Invasion|Military|Marines| 
US|Rumsfeld|U\.N\.|Gulf War|paramilitary|North Korea|coalition 
allies|House|para-military|US |India|\\\'shock and awe\\\'|Senate|George 
W. Bush|\\\"shock and awe\\\"|Al Qaeda|South Korea|Pentagon|President 
George W Bush|US,| allies|Washington|New York 
Times|wartime|Christianity|Taleban|Bush|New York Times|Al-Quaida|homeland 
security|Washington Post|Washington DC|Australian| Post|Mr\. Bush|United 
Nations|President Hussein|the United States|Homeland Security|Federal 
Reserve|America and Britain|Singapore|president Bush|the 
Phillipines|United States of America|Britain|Donald Rumsfeld|Republican 
Guards|Department of Homeland Security|Qaeda|China|al 
Qaeda|European|Department of State|Washington Post|the EU|George 
Bush|International Herald Tribune|UPI|WASHINGTON|U\.S\.|news 
network|European Union countries|FBI|British|Royal Air Force|United 
States|United Press International|Vietnam|Republican 
Guard|anti-war|CNN|Anti-war|Hussein|Russia|Al-Qaeda|Washington 
D\.C\.|Saddam Hussein|the United States of America|Chinese|Mexico|Girl 
Scouts|U\.S\.A\.|US-|the USA|Japanese|United Kingdom|US and 
Britain|UN|The US|Sri Lanka|Japan|weapons of mass destruction)')
strings = pat.split(data)
for i in xrange(1,len(strings),2):
     strings[i] = dict[strings[i]]
data = ''.join(strings)

I just checked these again, and the results were .052 on average for the 
first version, and .080 on average for the second. The second would 
obviously be a lot faster if I pre-compiled the pattern and cPickled it, 
but I wanted to include compilation time in the test.  The fastest 
version that I have tested is using Alex's replacement suggestion in 
combination with pre-compilation and cPickle, but the naive data = 
data.replace is surprisingly fast.

cheers,

nihilo





More information about the Python-list mailing list