how to avoid leading white spaces

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jun 6 11:29:23 EDT 2011


On Sun, 05 Jun 2011 23:03:39 -0700, rurpy at yahoo.com wrote:

> Thus what starts as
>   if line.startswith ('CUSTOMER '):
>     try: 
>       kw, first_initial, last_name, code, rest = line.split(None, 4)
>       ...
> often turns into (sometimes before it is written) something like
>   m = re.match (r'CUSTOMER (\w+) (\w+) ([A-Z]\d{3})') 
>   if m:
>     first_initial, last_name, code = m.group(...)


I would argue that the first, non-regex solution is superior, as it 
clearly distinguishes the multiple steps of the solution:

* filter lines that start with "CUSTOMER"
* extract fields in that line
* validate fields (not shown in your code snippet)

while the regex tries to do all of these in a single command. This makes 
the regex an "all or nothing" solution: it matches *everything* or 
*nothing*. This means that your opportunity for giving meaningful error 
messages is much reduced. E.g. I'd like to give an error message like:

    found digit in customer name (field 2)

but with your regex, if it fails to match, I have no idea why it failed, 
so can't give any more meaningful error than:

    invalid customer line

and leave it to the caller to determine what makes it invalid. (Did I 
misspell "CUSTOMER"? Put a dot after the initial? Forget the code? Use 
two spaces between fields instead of one?)



[...]
> I would expect
> any regex processor to compile the regex into an FSM.

Flying Spaghetti Monster?

I have been Touched by His Noodly Appendage!!!


[...]
>> The fact that creating a whole new string to split on is faster than
>> *running* the regex (never mind compiling it, loading the regex engine,
>> and anything else that needs to be done) should tell you which does
>> more work. Copying is cheap. Parsing is expensive.
> 
> In addition to being wrong (loading is done once, compilation is
> typically done once or a few times, while the regex is used many times
> inside a loop so the overhead cost is usually trivial compared with the
> cost of starting Python or reading a file), this is another
> micro-optimization argument.

Yes, but you have to pay the cost of loading the re engine, even if it is 
a one off cost, it's still a cost, and sometimes (not always!) it can be 
significant. It's quite hard to write fast, tiny Python scripts, because 
the initialization costs of the Python environment are so high. (Not as 
high as for, say, VB or Java, but much higher than, say, shell scripts.) 
In a tiny script, you may be better off avoiding regexes because it takes 
longer to load the engine than to run the rest of your script!

But yes, you are right that this is a micro-optimization argument. In a 
big application, it's less likely to be important.


> I'm not sure why you've suddenly developed this obsession with wringing
> every last nanosecond out of your code.  Usually it is not necessary. 
> Have you thought of buying a faster computer? Or using C?  *wink*

It's hardly an obsession. I'm just stating it as a relevant factor: for 
simple text parsing tasks, string methods are often *much* faster than 
regexes.


[...]
>> Ah, but if your requirements are complicated enough that it takes you
>> ten minutes and six lines of string method calls, that sounds to me
>> like a situation that probably calls for a regex!
> 
> Recall that the post that started this discussion presented a problem
> that took me six lines of code (actually spread out over a few more for
> readability) to do without regexes versus one line with.
> 
> So you do agree that that a regex was a better solution in that case?

I don't know... I'm afraid I can't find your six lines of code, and so 
can't judge it in comparison to your regex solution:

for line in f:
    fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)

My solution would probably be something like this:

for line in lines:
    if line.endswith("'"):
        line = line[:-1].rstrip() + "'"
   

although perhaps I've misunderstood the requirements.


[...]
>>> (Note that "Apocalypse" is referring to a series of Perl design
>>> documents and has nothing to do with regexes in particular.)
>>
>> But Apocalypse 5 specifically has everything to do with regexes. That's
>> why I linked to that, and not (say) Apocalypse 2.
> 
> Where did I suggest that you should have linked to Apocalypse 2? I wrote
> what I wrote to point out that the "Apocalypse" title was not a
> pejorative comment on regexes.  I don't see how I could have been
> clearer.

Possibly by saying what you just said here?

I never suggested, or implied, or thought, that "Apocalypse" was a 
pejorative comment on *regexes*. The fact that I referenced Apocalypse 
FIVE suggests strongly that there are at least four others, presumably 
not about regexes.


[...]
>> It is only relevant in so far as the readability and relative
>> obfuscation of regex syntax is relevant. No further.
> 
> OK, again you are confirming it is only the syntax of regexes that
> bothers you?

The syntax of regexes is a big part of it. I won't say the only part.


[...]
>> If regexes were more readable, as proposed by Wall, that would go a
>> long way to reducing my suspicion of them.
> 
> I am delighted to read that you find the new syntax more acceptable.

Perhaps I wasn't as clear as I could have been. I don't know what the new 
syntax is. I was referring to the design principle of improving the 
readability of regexes. Whether Wall's new syntax actually does improve 
readability and ease of maintenance is a separate issue, one on which I 
don't have an opinion on. I applaud his *intention* to reform regex 
syntax, without necessarily agreeing that he has done so.



-- 
Steven



More information about the Python-list mailing list