[Tutor] regex problem

Wed Jan 5 20:20:49 CET 2005

On Tue, Jan 04, 2005 at 09:15:46PM -0800, Danny Yoo wrote:
> 
> 
> On Tue, 4 Jan 2005, Michael Powe wrote:
> 
> > def parseFile(inFile) :
> >     import re
> >     bSpace = re.compile("^ ")
> >     multiSpace = re.compile(r"\s\s+")
> >     nbsp = re.compile(r"&nbsp;")
> >     HTMLRegEx =
> >     re.compile(r"(&lt;|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(&gt;|>)
> > ",re.I)
> >
> >     f = open(inFile,"r")
> >     lines = f.readlines()
> >     newLines = []
> >     for line in lines :
> >         line = HTMLRegEx.sub(' ',line)
> >         line = bSpace.sub('',line)
> >         line = nbsp.sub(' ',line)
> >         line = multiSpace.sub(' ',line)
> >         newLines.append(line)
> >     f.close()
> >     return newLines
> >
> > Now, the main issue I'm looking at is with the multiSpace regex.  When
> > applied, this removes some blank lines but not others.  I don't want it
> > to remove any blank lines, just contiguous multiple spaces in a line.
> 
> 
> Hi Michael,
> 
> Do you have an example of a file where this bug takes place?  As far as I
> can tell, since the processing is being done line-by-line, the program
> shouldn't be losing any blank lines at all.

That is what I thought.  And the effect is erratic, it removes some
but not all empty lines.

> Do you mean that the 'multiSpace' pattern is eating the line-terminating
> newlines?  If you don't want it to do this, you can modify the pattern
> slightly.  '\s' is defined to be this group of characters:
> 
>     '[ \t\n\r\f\v]'
> 
> (from http://www.python.org/doc/lib/re-syntax.html)
> 
> So we can adjust our pattern from:
> 
>     r"\s\s+"
> 
> to
> 
>     r"[ \t\f\v][ \t\f\v]+"
> 
> so that we don't capture newlines or carriage returns.  Regular
> expressions have a brace operator for dealing with repetition:
> if we're looking for at least 2 or more
> of some thing 'x', we can say:

I will take a look at this option.  Thanks.

mp