Taking data from a text file to parse html page

DH dylanhughes at gmail.com
Fri Aug 25 23:47:17 EDT 2006


Yes I know how to import modules... I think I found the problem, Linux
handles upper and lower case differently, so for some reason you can't
import SE but if you rename it to se it gives you the error that it
can't find SEL which if you rename it will complain that that SEL isn't
defined... Are you running Linux? Have you tested it with Linux?

> Surely you write your own programs. (program_name.py). You import and run them. You may put SE.PY and SEL.PY into the same
> directory. That's all.
>       Or if you prefer to keep other people's stuff in a different directory, just make sure that directory is in "sys.path",
> because that is where import looks. Check for that directory's presence in the sys.path list:
>
> >>> sys.path
> ['C:\\Python24\\Lib\\idlelib', 'C:\\', 'C:\\PYTHON24\\DLLs', 'C:\\PYTHON24\\lib', 'C:\\PYTHON24\\lib\\plat-win',
> 'C:\\PYTHON24\\lib\\lib-tk'     (... etc)    ]
>
> Supposing it isn't there, add it:
>
> >>> sys.path.append ('/python/code/other_peoples_stuff')
> >>> import SE
>
> That should do it. Let me know if it works. Else just keep asking.
>
> Frederic
>
>
> ----- Original Message -----
> From: "DH" <dylanhughes at gmail.com>
> Newsgroups: comp.lang.python
> To: <python-list at python.org>
> Sent: Friday, August 25, 2006 4:40 AM
> Subject: Re: Taking data from a text file to parse html page
>
>
> > SE looks very helpful... I'm having a hell of a time installing it
> > though:
> >
> > -----------------------------------------------------------------------------------------
> >
> > foo at foo:~/Desktop/SE-2.2$ sudo python SETUP.PY install
> > running install
> > running build
> > running build_py
> > file SEL.py (for module SEL) not found
> > file SE.py (for module SE) not found
> > file SEL.py (for module SEL) not found
> > file SE.py (for module SE) not found
> >
> > ------------------------------------------------------------------------------------------
> > Anthra Norell wrote:
> > > You may also want to look at this stream editor:
> > >
> > > http://cheeseshop.python.org/pypi/SE/2.2%20beta
> > >
> > > It allows multiple replacements in a definition format of utmost simplicity:
> > >
> > > >>> your_example = '''
> > > <div><p><em>"Python has been an important part of Google since the
> > > beginning, and remains so as the system grows and evolves.
> > > "</em></p>
> > > <p>-- Peter Norvig, <a class="reference"
> > > '''
> > > >>> import SE
> > > >>> Tag_Stripper = SE.SE ('''
> > >          "~<(.|\n)*?>~="   # This pattern finds all tags and deletes them (replaces with nothing)
> > >          "~<!--(.|\n)*?-->~="   # This pattern deletes comments entirely even if they nest tags
> > >          ''')
> > > >>> print Tag_Stripper (your_example)
> > >
> > > "Python has been an important part of Google since the
> > > beginning, and remains so as the system grows and evolves.
> > > "
> > > -- Peter Norvig, <a class="reference"
> > >
> > > Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***):
> > >
> > > Tag_Stripper = SE.SE ('''
> > >          "~<(.|\n)*?>~="   # This pattern finds all tags and deletes them (replaces with nothing)
> > >          "~<!--(.|\n)*?-->~="   # This pattern deletes commentsentirely even if they nest tags
> > >          "<a class\="reference"="    # *** This deletes the fragment
> > >          # "-- Peter Norvig, <a class\="reference"="  # Or like this if Peter Norvig has to go too
> > >        ''')
> > > >>> print Tag_Stripper (your_example)
> > >
> > > "Python has been an important part of Google since the
> > > beginning, and remains so as the system grows and evolves.
> > > "
> > > -- Peter Norvig,
> > >
> > > " you can either translate or delete:
> > >
> > > Tag_Stripper = SE.SE ('''
> > >          "~<(.|\n)*?>~="   # This pattern finds all tags and deletes them (replaces with nothing)
> > >          "~<!--(.|\n)*?-->~="   # This pattern deletes commentsentirely even if they nest tags
> > >          "<a class\="reference"="    # This deletes the fragment
> > >          # "-- Peter Norvig, <a class=\\"reference\\"="  # Or like this if Peter Norvig has to go too
> > >          htm2iso.se     # This is a file (contained in the SE package that translates all ampersand codes.
> > >                               # Naming the file is all you need to do to include the replacements which it defines.
> > >        ''')
> > >
> > > >>> print Tag_Stripper (your_example)
> > >
> > > 'Python has been an important part of Google since the
> > > beginning, and remains so as the system grows and evolves.
> > > '
> > > -- Peter Norvig,
> > >
> > > If instead of "htm2iso.se" you write ""=" you delete it and your output will be:
> > >
> > > Python has been an important part of Google since the
> > > beginning, and remains so as the system grows and evolves.
> > >
> > > -- Peter Norvig,
> > >
> > > Your Tag_Stripper also does files:
> > >
> > > >>> print Tag_Stripper ('my_file.htm', 'my_file_without_tags')
> > > 'my_file_without_tags'
> > >
> > >
> > > A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a
> > > parser does a lot of work which you don't need.
> > >
> > > Regards
> > >
> > > Frederic
> > >
> > >
> > > ----- Original Message -----
> > > From: "DH" <dylanhughes at gmail.com>
> > > Newsgroups: comp.lang.python
> > > To: <python-list at python.org>
> > > Sent: Thursday, August 24, 2006 7:41 PM
> > > Subject: Re: Taking data from a text file to parse html page
> > >
> > >
> > > > I found this
> > > >
> > >
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gst&q=replace+text+file&r
> > > num=8#ad0ac6b1ac8cff51
> > > >
> > > > Credit Jeremy Moles
> > > > -----------------------------------------------
> > > >
> > > > finds = ("{", "}", "(", ")")
> > > > lines = file("foo.txt", "r").readlines()
> > > >
> > > > for line in lines:
> > > >         for find in finds:
> > > >                 if find in line:
> > > >                         line.replace(find, "")
> > > >
> > > > print lines
> > > >
> > > > -----------------------------------------------
> > > >
> > > > I want something like
> > > > -----------------------------------------------
> > > >
> > > > finds = file("replace.txt")
> > > > lines = file("foo.txt", "r").readlines()
> > > >
> > > > for line in lines:
> > > >         for find in finds:
> > > >                 if find in line:
> > > >                         line.replace(find, "")
> > > >
> > > > print lines
> > > >
> > > > -----------------------------------------------
> > > >
> > > >
> > > >
> > > > Fredrik Lundh wrote:
> > > > > DH wrote:
> > > > >
> > > > > > I have a plain text file containing the html and words that I want
> > > > > > removed(keywords) from the html file, after processing the html file it
> > > > > > would save it as a plain text file.
> > > > > >
> > > > > > So the program would import the keywords, remove them from the html
> > > > > > file and save the html  file as something.txt.
> > > > > >
> > > > > > I would post the data but it's secret. I can post an example:
> > > > > >
> > > > > > index.html (html page)
> > > > > >
> > > > > > "
> > > > > > <div><p><em>"Python has been an important part of Google since the
> > > > > > beginning, and remains so as the system grows and evolves.
> > > > > > "</em></p>
> > > > > > <p>-- Peter Norvig, <a class="reference"
> > > > > > "
> > > > > >
> > > > > > replace.txt (keywords)
> > > > > > "
> > > > > > <div id="quote" class="homepage-box">
> > > > > >
> > > > > > <div><p><em>"
> > > > > >
> > > > > > "</em></p>
> > > > > >
> > > > > > <p>-- Peter Norvig, <a class="reference"
> > > > > >
> > > > > > "
> > > > > >
> > > > > > something.txt(file after editing)
> > > > > >
> > > > > > "
> > > > > >
> > > > > > Python has been an important part of Google since the beginning, and
> > > > > > remains so as the system grows and evolves.
> > > > > > "
> > > > >
> > > > > reading and writing files is described in the tutorial; see
> > > > >
> > > > >      http://pytut.infogami.com/node9.html
> > > > >
> > > > > (scroll down to "Reading and Writing Files")
> > > > >
> > > > > to do the replacement, you can use repeated calls to the "replace" method
> > > > >
> > > > >      http://pyref.infogami.com/str.replace
> > > > >
> > > > > but that may cause problems if the replacement text contains things that
> > > > > should be replaced.  for an efficient way to do a "parallel" replace, see:
> > > > >
> > > > >      http://effbot.org/zone/python-replace.htm#multiple
> > > > >
> > > > >
> > > > > </F>
> > > >
> > > > --
> > > > http://mail.python.org/mailman/listinfo/python-list
> >
> > --
> > http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list