a text processing problem: are regexpressions necessary?
Joonas Paalasmaa
joonas at olen.to
Sun Mar 17 15:44:02 EST 2002
Sandy Norton wrote:
>
> Hi,
>
> I thought I'd share this problem which has just confronted me.
>
> The problem
>
> An automatic way to tranform urls to articles on various news sites to
> their printerfriendly counterparts is complicated by the fact that
> different sites have different schemes for doing this. (see examples
> below)
>
> Now given two examples for each site: a regular link to an article and
> its printer-friendly counterpart, is there a way to automatically
> generate transformation code that is specific to each site, but which
> generalizes across all article urls within that site?
>
> Here are a few examples from several online publications:
>
> http://news.bbc.co.uk/hi/english/world/africa/newsid_1871000/1871611.stm
> http://news.bbc.co.uk/low/english/world/africa/newsid_1871000/1871611.stm
>
> http://www.economist.com/agenda/displayStory.cfm?Story_ID=1043688
> http://www.economist.com/agenda/PrinterFriendly.cfm?Story_ID=1043688
>
> http://www.nationalreview.com/ponnuru/ponnuru031502.shtml
> http://www.nationalreview.com/ponnuru/ponnuruprint031502.html
>
> http://www.thenation.com/doc.mhtml?i=20020204&s=said
> http://www.thenation.com/docPrint.mhtml?i=20020204&s=said
>
> I'm kinda heading in the direction of attempting to generate regular
> expressions for each site... But I'm a bit apprehensive about doing
> this. Is there a more pythonic way to approach this problem?
I would create a dictionary where beginnings of the urls would be keys.
Values of the dictionary would be tuples that contain a sting to be
replaced and the replacement.
replDict = {"http://www.economist.com/agenda/":
("displayStory", "PrinterFriendly.cfm"), "http://www.thenation.com":
("doc.mhtml", "docPrint.mhtml")}
Then check what key matches the url and do the replacing with
string.replace.
More information about the Python-list
mailing list