how to avoid leading white spaces

Tue Jun 7 12:00:34 EDT 2011

On 06/06/2011 09:29 AM, Steven D'Aprano wrote:
> On Sun, 05 Jun 2011 23:03:39 -0700, rurpy at yahoo.com wrote:
[...]
> I would argue that the first, non-regex solution is superior, as it
> clearly distinguishes the multiple steps of the solution:
>
> * filter lines that start with "CUSTOMER"
> * extract fields in that line
> * validate fields (not shown in your code snippet)
>
> while the regex tries to do all of these in a single command. This makes
> the regex an "all or nothing" solution: it matches *everything* or
> *nothing*. This means that your opportunity for giving meaningful error
> messages is much reduced. E.g. I'd like to give an error message like:
>
>     found digit in customer name (field 2)
>
> but with your regex, if it fails to match, I have no idea why it failed,
> so can't give any more meaningful error than:
>
>     invalid customer line
>
> and leave it to the caller to determine what makes it invalid. (Did I
> misspell "CUSTOMER"? Put a dot after the initial? Forget the code? Use
> two spaces between fields instead of one?)

I agree that is a legitimate criticism.  Its importance depends
greatly on the purpose and consumers of the code.  While such
detailed error messages might be appropriate in a fully polished
product, in my case, I often have to process files personally
to extract information, or to provide code to others (who typically
have at least some degree of technical sophistication) to do the
same.

In this case, being able to code something quickly, and adapt it
quickly to changes is more important than providing highly detailed
error messages.  The format is simple enough that "invalid customer
line" and the line number is perfectly adaquate.  YMMV.

As I said, regexes are a tool, like any tool, to be used
appropriately.

[...]
>> In addition to being wrong (loading is done once, compilation is
>> typically done once or a few times, while the regex is used many times
>> inside a loop so the overhead cost is usually trivial compared with the
>> cost of starting Python or reading a file), this is another
>> micro-optimization argument.
>
> Yes, but you have to pay the cost of loading the re engine, even if it is
> a one off cost, it's still a cost,

~$ time python -c 'pass'
real	0m0.015s
user	0m0.011s
sys	0m0.003s

~$ time python -c 'import re'
real	0m0.015s
user	0m0.011s
sys	0m0.003s

Or do you mean something else by "loading the re engine"?

> and sometimes (not always!) it can be
> significant. It's quite hard to write fast, tiny Python scripts, because
> the initialization costs of the Python environment are so high. (Not as
> high as for, say, VB or Java, but much higher than, say, shell scripts.)
> In a tiny script, you may be better off avoiding regexes because it takes
> longer to load the engine than to run the rest of your script!

Do you have an example?  I am having a hard time imagining that.
Perhaps you are thinking on the time require to compile a RE?

~$ time python -c 'import re; re.compile(r"^[^()]*(\([^()]*\)[^()]*)*
$")'
real	0m0.017s
user	0m0.014s
sys	0m0.003s

Hard to imagine a case where where 15mS is fast enough but
17mS is too slow.  And that's without the diluting effect
of actually doing some real work in the script.  Of course
a more complex regex would likely take longer.

(The times vary greatly on my machine, I am quoting the most
common lowest but not absolutely lowest results.)

>>>> (Note that "Apocalypse" is referring to a series of Perl design
>>>> documents and has nothing to do with regexes in particular.)
>>>
>>> But Apocalypse 5 specifically has everything to do with regexes. That's
>>> why I linked to that, and not (say) Apocalypse 2.
>>
>> Where did I suggest that you should have linked to Apocalypse 2? I wrote
>> what I wrote to point out that the "Apocalypse" title was not a
>> pejorative comment on regexes.  I don't see how I could have been
>> clearer.
>
> Possibly by saying what you just said here?
>
> I never suggested, or implied, or thought, that "Apocalypse" was a
> pejorative comment on *regexes*. The fact that I referenced Apocalypse
> FIVE suggests strongly that there are at least four others, presumably
> not about regexes.

Nor did I ever suggest you did.  Don't forget that you are
not the only person reading this list.  The comment was for
the benefit of others.  Perhaps you are being overly sensitive?

> [...]
>>> If regexes were more readable, as proposed by Wall, that would go a
>>> long way to reducing my suspicion of them.
>>
>> I am delighted to read that you find the new syntax more acceptable.
>
> Perhaps I wasn't as clear as I could have been. I don't know what the new
> syntax is. I was referring to the design principle of improving the
> readability of regexes. Whether Wall's new syntax actually does improve
> readability and ease of maintenance is a separate issue, one on which I
> don't have an opinion on. I applaud his *intention* to reform regex
> syntax, without necessarily agreeing that he has done so.

Thanks for clarifying.  But since you earlier wrote in response
to MRAB,
http://groups.google.com/group/comp.lang.python/msg/43f3a81d9cc75217?

  "Have you considered the suggested Perl 6 syntax? Much of
  it looks good to me."

I'm sure you can understand my confusion.