RE Module

Roman rgelfand2 at hotmail.com
Fri Aug 25 12:14:25 EDT 2006


Thanks for your help.

A thing I didn't mention is that before the statement row[0] =
re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^
0-9A-Za-z\"\'\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0])
statement.  Hence, the line separators are going to be gone.  You
mentioned the size of the string could be a factor.  If so what is the
max size before I see problems?

Thanks again
Anthra Norell wrote:
> Roman,
>
> Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
> doesn't work. You need to catch the tags like this:
>
> >>> text = re.sub ('<(.|\n)*?>', '', text)
>
> If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
> that does the buffering for you:
>
> http://cheeseshop.python.org/pypi/SE/2.2%20beta
>
> >>> import SE
> >>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~="  "~<!--(.|\n)*?-->~=" ')
> >>> print Tag_Stripper (text)
> (... your text without tags ...)
>
> The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
> also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
> "re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.
>
> Your Tag_Stripper takes input from files directly:
>
> >>> Tag_Stripper ('name_of_file.htm', 'name_of_output_file')
> 'name_of_output_file'
>
> Or if you want to to view the output:
>
> >>> Tag_Stripper ('name_of_file.htm', '')
> (... your text without tags ...)
>
> If you want to keep the definitions for later use, do this:
>
> >>> Tag_Stripper.save ('[your_path/]tag_stripper.se')
>
> Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
> make it simply by naming the file:
>
> >>> Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')
>
> You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( 
> etc.) you'd simply add the name of the file that defines the ampersand replacements:
>
> >>> Tag_Stripper = SE.SE ('tag_stripper.se  htm2iso.se')
>
> 'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.
>
>
> Frederic
>
>
> ----- Original Message -----
> From: "Simon Forman" <rogue_pedro at yahoo.com>
> Newsgroups: comp.lang.python
> To: <python-list at python.org>
> Sent: Friday, August 25, 2006 7:09 AM
> Subject: Re: RE Module
>
>
> > Roman wrote:
> > > I am trying to filter a column in a list of all html tags.
> >
> > What?
> >
> > > To do that, I have setup the following statement.
> > >
> > > row[0] = re.sub(r'<.*?>', '', row[0])
> > >
> > > The results I get are sporatic.  Sometimes two tags are removed.
> > > Sometimes 1 tag is removed.   Sometimes no tags are removed.  Could
> > > somebody tell me where have I gone wrong here?
> > >
> > > Thanks in advance
> >
> > I'm no re expert, so I won't try to advise you on your re, but it might
> > help those who are if you gave examples of your input and output data.
> > What results are you getting for what input strings.
> >
> > Also, if you're just trying to strip html markup to get plain text from
> > a file, "w3m -dump some.html"  works great.  ;-)
> >
> > HTH,
> > ~Simon
> >
> > --
> > http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list