RE Module
Anthra Norell
anthra.norell at tiscalinet.ch
Fri Aug 25 06:17:05 EDT 2006
Roman,
Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:
>>> text = re.sub ('<(.|\n)*?>', '', text)
If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:
http://cheeseshop.python.org/pypi/SE/2.2%20beta
>>> import SE
>>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ')
>>> print Tag_Stripper (text)
(... your text without tags ...)
The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.
Your Tag_Stripper takes input from files directly:
>>> Tag_Stripper ('name_of_file.htm', 'name_of_output_file')
'name_of_output_file'
Or if you want to to view the output:
>>> Tag_Stripper ('name_of_file.htm', '')
(... your text without tags ...)
If you want to keep the definitions for later use, do this:
>>> Tag_Stripper.save ('[your_path/]tag_stripper.se')
Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:
>>> Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')
You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes (
etc.) you'd simply add the name of the file that defines the ampersand replacements:
>>> Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se')
'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.
Frederic
----- Original Message -----
From: "Simon Forman" <rogue_pedro at yahoo.com>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module
> Roman wrote:
> > I am trying to filter a column in a list of all html tags.
>
> What?
>
> > To do that, I have setup the following statement.
> >
> > row[0] = re.sub(r'<.*?>', '', row[0])
> >
> > The results I get are sporatic. Sometimes two tags are removed.
> > Sometimes 1 tag is removed. Sometimes no tags are removed. Could
> > somebody tell me where have I gone wrong here?
> >
> > Thanks in advance
>
> I'm no re expert, so I won't try to advise you on your re, but it might
> help those who are if you gave examples of your input and output data.
> What results are you getting for what input strings.
>
> Also, if you're just trying to strip html markup to get plain text from
> a file, "w3m -dump some.html" works great. ;-)
>
> HTH,
> ~Simon
>
> --
> http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list