Taking data from a text file to parse html page
Anthra Norell
anthra.norell at tiscalinet.ch
Thu Aug 24 15:03:09 EDT 2006
You may also want to look at this stream editor:
http://cheeseshop.python.org/pypi/SE/2.2%20beta
It allows multiple replacements in a definition format of utmost simplicity:
>>> your_example = '''
<div><p><em>"Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
"</em></p>
<p>-- Peter Norvig, <a class="reference"
'''
>>> import SE
>>> Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes comments entirely even if they nest tags
''')
>>> print Tag_Stripper (your_example)
"Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
"
-- Peter Norvig, <a class="reference"
Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # *** This deletes the fragment
# "-- Peter Norvig, <a class\="reference"=" # Or like this if Peter Norvig has to go too
''')
>>> print Tag_Stripper (your_example)
"Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
"
-- Peter Norvig,
" you can either translate or delete:
Tag_Stripper = SE.SE ('''
"~<(.|\n)*?>~=" # This pattern finds all tags and deletes them (replaces with nothing)
"~<!--(.|\n)*?-->~=" # This pattern deletes commentsentirely even if they nest tags
"<a class\="reference"=" # This deletes the fragment
# "-- Peter Norvig, <a class=\\"reference\\"=" # Or like this if Peter Norvig has to go too
htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes.
# Naming the file is all you need to do to include the replacements which it defines.
''')
>>> print Tag_Stripper (your_example)
'Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
'
-- Peter Norvig,
If instead of "htm2iso.se" you write ""=" you delete it and your output will be:
Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
-- Peter Norvig,
Your Tag_Stripper also does files:
>>> print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
'my_file_without_tags'
A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
parser does a lot of work which you don't need.
Regards
Frederic
----- Original Message -----
From: "DH" <dylanhughes at gmail.com>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Thursday, August 24, 2006 7:41 PM
Subject: Re: Taking data from a text file to parse html page
> I found this
>
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gst&q=replace+text+file&r
num=8#ad0ac6b1ac8cff51
>
> Credit Jeremy Moles
> -----------------------------------------------
>
> finds = ("{", "}", "(", ")")
> lines = file("foo.txt", "r").readlines()
>
> for line in lines:
> for find in finds:
> if find in line:
> line.replace(find, "")
>
> print lines
>
> -----------------------------------------------
>
> I want something like
> -----------------------------------------------
>
> finds = file("replace.txt")
> lines = file("foo.txt", "r").readlines()
>
> for line in lines:
> for find in finds:
> if find in line:
> line.replace(find, "")
>
> print lines
>
> -----------------------------------------------
>
>
>
> Fredrik Lundh wrote:
> > DH wrote:
> >
> > > I have a plain text file containing the html and words that I want
> > > removed(keywords) from the html file, after processing the html file it
> > > would save it as a plain text file.
> > >
> > > So the program would import the keywords, remove them from the html
> > > file and save the html file as something.txt.
> > >
> > > I would post the data but it's secret. I can post an example:
> > >
> > > index.html (html page)
> > >
> > > "
> > > <div><p><em>"Python has been an important part of Google since the
> > > beginning, and remains so as the system grows and evolves.
> > > "</em></p>
> > > <p>-- Peter Norvig, <a class="reference"
> > > "
> > >
> > > replace.txt (keywords)
> > > "
> > > <div id="quote" class="homepage-box">
> > >
> > > <div><p><em>"
> > >
> > > "</em></p>
> > >
> > > <p>-- Peter Norvig, <a class="reference"
> > >
> > > "
> > >
> > > something.txt(file after editing)
> > >
> > > "
> > >
> > > Python has been an important part of Google since the beginning, and
> > > remains so as the system grows and evolves.
> > > "
> >
> > reading and writing files is described in the tutorial; see
> >
> > http://pytut.infogami.com/node9.html
> >
> > (scroll down to "Reading and Writing Files")
> >
> > to do the replacement, you can use repeated calls to the "replace" method
> >
> > http://pyref.infogami.com/str.replace
> >
> > but that may cause problems if the replacement text contains things that
> > should be replaced. for an efficient way to do a "parallel" replace, see:
> >
> > http://effbot.org/zone/python-replace.htm#multiple
> >
> >
> > </F>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list