[Tutor] regex problem
Michael Powe
michael at trollope.org
Wed Jan 5 20:20:49 CET 2005
On Tue, Jan 04, 2005 at 09:15:46PM -0800, Danny Yoo wrote:
>
>
> On Tue, 4 Jan 2005, Michael Powe wrote:
>
> > def parseFile(inFile) :
> > import re
> > bSpace = re.compile("^ ")
> > multiSpace = re.compile(r"\s\s+")
> > nbsp = re.compile(r" ")
> > HTMLRegEx =
> > re.compile(r"(<|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(>|>)
> > ",re.I)
> >
> > f = open(inFile,"r")
> > lines = f.readlines()
> > newLines = []
> > for line in lines :
> > line = HTMLRegEx.sub(' ',line)
> > line = bSpace.sub('',line)
> > line = nbsp.sub(' ',line)
> > line = multiSpace.sub(' ',line)
> > newLines.append(line)
> > f.close()
> > return newLines
> >
> > Now, the main issue I'm looking at is with the multiSpace regex. When
> > applied, this removes some blank lines but not others. I don't want it
> > to remove any blank lines, just contiguous multiple spaces in a line.
>
>
> Hi Michael,
>
> Do you have an example of a file where this bug takes place? As far as I
> can tell, since the processing is being done line-by-line, the program
> shouldn't be losing any blank lines at all.
That is what I thought. And the effect is erratic, it removes some
but not all empty lines.
> Do you mean that the 'multiSpace' pattern is eating the line-terminating
> newlines? If you don't want it to do this, you can modify the pattern
> slightly. '\s' is defined to be this group of characters:
>
> '[ \t\n\r\f\v]'
>
> (from http://www.python.org/doc/lib/re-syntax.html)
>
> So we can adjust our pattern from:
>
> r"\s\s+"
>
> to
>
> r"[ \t\f\v][ \t\f\v]+"
>
> so that we don't capture newlines or carriage returns. Regular
> expressions have a brace operator for dealing with repetition:
> if we're looking for at least 2 or more
> of some thing 'x', we can say:
I will take a look at this option. Thanks.
mp
More information about the Tutor
mailing list