a text processing problem: are regexpressions necessary?

Sun Mar 17 15:44:02 EST 2002

Sandy Norton wrote:
> 
> Hi,
> 
> I thought I'd share this problem which has just confronted me.
> 
> The problem
> 
> An automatic way to tranform urls to articles on various news sites to
> their printerfriendly counterparts is complicated by the fact that
> different sites have different schemes for doing this. (see examples
> below)
> 
> Now given two examples for each site: a regular link to an article and
> its printer-friendly counterpart, is there a way to automatically
> generate transformation code that is specific to each site, but which
> generalizes across all article urls within that site?
> 
> Here are a few examples from several online publications:
> 
> http://news.bbc.co.uk/hi/english/world/africa/newsid_1871000/1871611.stm
> http://news.bbc.co.uk/low/english/world/africa/newsid_1871000/1871611.stm
> 
> http://www.economist.com/agenda/displayStory.cfm?Story_ID=1043688
> http://www.economist.com/agenda/PrinterFriendly.cfm?Story_ID=1043688
> 
> http://www.nationalreview.com/ponnuru/ponnuru031502.shtml
> http://www.nationalreview.com/ponnuru/ponnuruprint031502.html
> 
> http://www.thenation.com/doc.mhtml?i=20020204&s=said
> http://www.thenation.com/docPrint.mhtml?i=20020204&s=said
> 
> I'm kinda heading in the direction of attempting to generate regular
> expressions for each site... But I'm a bit apprehensive about doing
> this. Is there a more pythonic way to approach this problem?

I would create a dictionary where beginnings of the urls would be keys.
Values of the dictionary would be tuples that contain a sting to be
replaced and the replacement.

replDict = {"http://www.economist.com/agenda/":
("displayStory", "PrinterFriendly.cfm"), "http://www.thenation.com":
("doc.mhtml", "docPrint.mhtml")}

Then check what key matches the url and do the replacing with
string.replace.