RE Module

Fri Aug 25 17:10:26 EDT 2006

Roman,

I don't quite understand what you mean. Line separators gone? That would be the '\n', right? What of it if you process line by line,
as your variable name 'row' suggests?
      As to the maximum size re can handle, I have no idea. I vaguely remember the topic being discussed. You should be able to find
the discussions in the archives, if a knowlegeable soul doesn't volunteer the info right away. With SE it is of no concern.

Anyway, I think the best thing to do is to just try with a real page:

>>> f = urllib.urlopen (r'http://www.python.org')
>>> page = f.read (); f.close ()
>>> import SE
>>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~="  "~<!--(.|\n)*?-->~=" ')
>>> Tag_Stripper (page)
( ... page without tags, but lots of empty lines ...)

If you want to take the empty lines out, do this:

>>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~="  "~<!--(.|\n)*?-->~="  |  "~\r?\n\s+?(?=\r?\n)~="  |  "~(\r?\n)+~=\n" ')

"|" means do the preceding replacements (which happen to be deletions: replace with nothing) and go on from there. The expressions
we added say: delete lines that contain only spaces. Do that (another "|"). And finally replace multiple consecutive line feeds with
a single line feed.
      So you can develop interactively. Add a definition. See what it does. Add another one. One little step at a time. Hacking at
its best!

Frederic

----- Original Message -----
From: "Roman" <rgelfand2 at hotmail.com>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Friday, August 25, 2006 6:14 PM
Subject: Re: RE Module

> Thanks for your help.
>
> A thing I didn't mention is that before the statement row[0] =
> re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^
> 0-9A-Za-z\"\'\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0])
> statement.  Hence, the line separators are going to be gone.  You
> mentioned the size of the string could be a factor.  If so what is the
> max size before I see problems?
>
> Thanks again
> Anthra Norell wrote:
> > Roman,
> >
> > Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
> > doesn't work. You need to catch the tags like this:
> >
> > >>> text = re.sub ('<(.|\n)*?>', '', text)
> >
> > If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream
edtor
> > that does the buffering for you:
> >
> > http://cheeseshop.python.org/pypi/SE/2.2%20beta
> >
> > >>> import SE
> > >>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~="  "~<!--(.|\n)*?-->~=" ')
> > >>> print Tag_Stripper (text)
> > (... your text without tags ...)
> >
> > The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone
would
> > also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The
example
> > "re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.
> >
> > Your Tag_Stripper takes input from files directly:
> >
> > >>> Tag_Stripper ('name_of_file.htm', 'name_of_output_file')
> > 'name_of_output_file'
> >
> > Or if you want to to view the output:
> >
> > >>> Tag_Stripper ('name_of_file.htm', '')
> > (... your text without tags ...)
> >
> > If you want to keep the definitions for later use, do this:
> >
> > >>> Tag_Stripper.save ('[your_path/]tag_stripper.se')
> >
> > Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you
can
> > make it simply by naming the file:
> >
> > >>> Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')
> >
> > You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes
( 
> > etc.) you'd simply add the name of the file that defines the ampersand replacements:
> >
> > >>> Tag_Stripper = SE.SE ('tag_stripper.se  htm2iso.se')
> >
> > 'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.
> >
> >
> > Frederic
> >
> >
> > ----- Original Message -----
> > From: "Simon Forman" <rogue_pedro at yahoo.com>
> > Newsgroups: comp.lang.python
> > To: <python-list at python.org>
> > Sent: Friday, August 25, 2006 7:09 AM
> > Subject: Re: RE Module
> >
> >
> > > Roman wrote:
> > > > I am trying to filter a column in a list of all html tags.
> > >
> > > What?
> > >
> > > > To do that, I have setup the following statement.
> > > >
> > > > row[0] = re.sub(r'<.*?>', '', row[0])
> > > >
> > > > The results I get are sporatic.  Sometimes two tags are removed.
> > > > Sometimes 1 tag is removed.   Sometimes no tags are removed.  Could
> > > > somebody tell me where have I gone wrong here?
> > > >
> > > > Thanks in advance
> > >
> > > I'm no re expert, so I won't try to advise you on your re, but it might
> > > help those who are if you gave examples of your input and output data.
> > > What results are you getting for what input strings.
> > >
> > > Also, if you're just trying to strip html markup to get plain text from
> > > a file, "w3m -dump some.html"  works great.  ;-)
> > >
> > > HTH,
> > > ~Simon
> > >
> > > --
> > > http://mail.python.org/mailman/listinfo/python-list
>
> --
> http://mail.python.org/mailman/listinfo/python-list