Program inefficiency?

Pablo Ziliani pablo at decode.com.ar
Sat Sep 29 19:32:54 EDT 2007


thebjorn wrote:
> On Sep 29, 7:55 pm, Pablo Ziliani <pa... at decode.com.ar> wrote:
>   
>> thebjorn wrote:
>>     
>>> Ugh, that was entirely too many regexps for my taste :-)
>> Oh yeah, now it's clear as mud.
>>     
>
> I'm anxiously awaiting your beacon of clarity ;-)
>   

Admittedly, that was a bit arrogant from my part. Sorry.

>> I do think that the whole program shouldn't take more than 10 lines of
>> code
>>     
>
> Well, my mass_replace above is 10 lines, and the actual replacement
> code is a one liner. Perhaps you'd care to illustrate how you'd
> shorten that while still keeping it "clear"?
>   

I don't think he relevant code was only those 10 lines, but well, you 
have already responded to the other question yourself in a subsequent 
post (thanks for saving me a lot of time).
I think that "clear" is a compromise between code legibility (most of 
what you sacrifice using regexes) and overall code length. Even regexes 
can be legible enough when they are well documented, not to mention the 
fact that is an idiom common to various languages.

>> using one sensible regex
>>     
>
> I have no doubt that it would be possible to do with a single regex.
> Whether it would be sensible or not is another matter entirely...
>   

Putting it in those terms, I completely agree with you (that's why I 
suggested letting e.g. BeautifulSoup deal with them). But by "sensible" 
I meant something different, inherent to the regex itself.
For instance, I don't think I need to explain to you why this is not 
sensible: (href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<:)+(.*)(">)+


>   
>> (impossible to define without knowing the real input and output formats).
>>     
>
> Of course, but I don't think you can guess too terribly wrong. My
> version handles upper and lower case attributes, quoting with single
> (') and double (") quotes, and any number of spaces in attribute
> values. It maintains all other text as-is, and converts spaces to
> underscores in href and name attributes. Did I get anything majorly
> wrong?
>   

Well, you spent some time interpreting his code. No doubt you are smart, 
but being a lazy person (not proud of that, unlike other people stating 
the same) I prefer leaving that part to the interested party.


>   
>> And (sorry to tell) I'm convinced this is a problem for regexes, in
>> spite of anybody's personal taste.
>>     
>
> Well, let's see it then :-)

IMO, your second example proves it well enough.

FWIW I did some changes to your code (see attached), because it wasn't 
taking into account the tag name (<a>), and the names of the attributes 
(href, name) can appear in other tags as well, so it's a problem. It 
still doesn't solve the problem of one tag having both attributes with 
spaces (which can be easily fixed with a second regex, but that was out 
of question :P), and there can be a lot of other problems (both because 
I'm far from being an expert in regexes and because I only tested it 
against the given string), but should provide at least some guidance.
I made it also match the id of the target anchor, since a fragment can 
point both to its name or its id, depending on the doctype.


Regards,
Pablo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fixurls.py
Type: text/x-python
Size: 1339 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20070929/dcc1b73d/attachment.py>


More information about the Python-list mailing list