a text processing problem: are regexpressions necessary?

William Park opengeometry at NOSPAM.yahoo.ca
Sun Mar 17 13:57:26 EST 2002


Sandy Norton <sandskyfly at hotmail.com> wrote:
> Hi,
> 
> I thought I'd share this problem which has just confronted me.
> 
> The problem
> 
> An automatic way to tranform urls to articles on various news sites to
> their printerfriendly counterparts is complicated by the fact that
> different sites have different schemes for doing this. (see examples
> below)
> 
> Now given two examples for each site: a regular link to an article and
> its printer-friendly counterpart, is there a way to automatically
> generate transformation code that is specific to each site, but which
> generalizes across all article urls within that site?

You already identified the central issue.  Each site is different, but
presumably consistent within the site.  I guess you can build up Sed
scripts...

> http://news.bbc.co.uk/hi/english/world/africa/newsid_1871000/1871611.stm
> http://news.bbc.co.uk/low/english/world/africa/newsid_1871000/1871611.stm

sed -e 's,http://news.bbc.co.uk/hi/,http://news.bbc.co.uk/low/,'

> http://www.economist.com/agenda/displayStory.cfm?Story_ID=1043688
> http://www.economist.com/agenda/PrinterFriendly.cfm?Story_ID=1043688

grep 'http://www.economist.com/' | sed -e 's,[^/]*\.cfm\?,PrinterFriendly.cfm?,'

> http://www.nationalreview.com/ponnuru/ponnuru031502.shtml
> http://www.nationalreview.com/ponnuru/ponnuruprint031502.html

grep 'http://www.nationalreview.com/' | sed -e 's,\([0-9]\+\)\.shtml$,print\1.html,'

> http://www.thenation.com/doc.mhtml?i=20020204&s=said
> http://www.thenation.com/docPrint.mhtml?i=20020204&s=said

grep 'http://www.thenation.com/' | sed -e 's,\.mhtml\?,Print&,'

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin



More information about the Python-list mailing list