a text processing problem: are regexpressions necessary?

Sun Mar 17 15:06:29 EST 2002

On 17 Mar 2002 05:29:18 -0800, sandskyfly at hotmail.com (Sandy Norton) wrote:

>Hi,
>
>I thought I'd share this problem which has just confronted me.
>
>The problem
>
>An automatic way to tranform urls to articles on various news sites to
>their printerfriendly counterparts is complicated by the fact that
>different sites have different schemes for doing this. (see examples
>below)
> 
>Now given two examples for each site: a regular link to an article and
>its printer-friendly counterpart, is there a way to automatically
>generate transformation code that is specific to each site, but which
>generalizes across all article urls within that site?
> 
>Here are a few examples from several online publications:
>
>http://news.bbc.co.uk/hi/english/world/africa/newsid_1871000/1871611.stm
>http://news.bbc.co.uk/low/english/world/africa/newsid_1871000/1871611.stm
>    
>http://www.economist.com/agenda/displayStory.cfm?Story_ID=1043688
>http://www.economist.com/agenda/PrinterFriendly.cfm?Story_ID=1043688
>    
>http://www.nationalreview.com/ponnuru/ponnuru031502.shtml
>http://www.nationalreview.com/ponnuru/ponnuruprint031502.html
>    
>http://www.thenation.com/doc.mhtml?i=20020204&s=said
>http://www.thenation.com/docPrint.mhtml?i=20020204&s=said
>
>I'm kinda heading in the direction of attempting to generate regular
>expressions for each site... But I'm a bit apprehensive about doing
>this. Is there a more pythonic way to approach this problem?
>
>Any advice would be appreciated.
>
My approach would be to write a function that accepts an example
pair of urls as above and returns a functor that will transform
the display version to the print version. A functor is a callable
instance of a class, and a class instance can contain data as well
as the method (__call__) that gets called when you use the instance
as a function. Thus you can store rules, regexes or whatever necessary
to do the job in the class instance, once your function has those
generated. E.g., your function could pass the rules it determined to the
class constructor and then return the instance. Of course you design
your functor class to support what you need for the above.

You might want to put the functors in a dictionary according to site
domain (key modified if need be to distinguish different rules generated
for different parts of the same site). E.g., if you have disp_url and print_url,
with one rule per site, you could store it something like

    site_dom = disp_url.split('/')[2] # assuming "http://site_dom/..."
    funcdict[site_dom] = get_url_rewrite_functor(disp_url, print_url)

and then later use the stored rules to convert some new display url by

    site_dom = new_disp_url.split('/')[2]
    new_print_url = funcdict[site_dom](new_disp_url)

Now it's just a matter of writing get_url_rewrite_functor ;-)

To analyze the differences between urls, I think I would start by
splitting them with re.compile(r'([/?])') and then deal with
the differing pieces one by one. I might split those further
by numbers and periods (re.compile(r'([\d.]+)') and/or scan them
for matching prefix and suffix sequences. It will probably be easier
to store rules as a list of (fcn,args..) tuples to operate on selected
items of the first level split than generating a single glorious regex
substitution, but OTOH it might not be that bad. You'll also
have to decide whether to treat url elements that match in examples
as constants to verify or variables to pass through. Some are more
obvious than others. For a start you can probably pass everything
through that you don't generate a change rule for, but it's good to
flag violations of your assumptions.

Your best final approach may depend on what your performance requirements
are, but you're probably better of optimizing later, after you get some
play time with the algorithms.

BTW, give your functor class a __str__ method to return a printable
representation of its rules in some form, to help debugging.

Anyway, that's how I'd start out. YMMV.
HTH,

Regards,
Bengt Richter