replacing words in HTML file

Cameron Simpson cs at zip.com.au
Thu Apr 29 18:47:27 EDT 2010


On 29Apr2010 05:03, james_027 <cai.haibin at gmail.com> wrote:
| On Apr 29, 5:31 am, Cameron Simpson <c... at zip.com.au> wrote:
| > On 28Apr2010 22:03, Daniel Fetchinson <fetchin... at googlemail.com> wrote:
| > | > Any idea how I can replace words in a html file? Meaning only the
| > | > content will get replace while the html tags, javascript, & css are
| > | > remain untouch.
[...]
| > The only way to get this right is to parse the file, then walk the doc
| > tree enditing only the text parts.
| >
| > The BeautifulSoup module (3rd party, but a single .py file and trivial to
| > fetch and use, though it has some dependencies) does a good job of this,
| > coping even with typical not quite right HTML. It gives you a parse
| > tree you can easily walk, and you can modify it in place and write it
| > straight back out.
| 
| Thanks for all your input. Cameron Simpson get the idea of what I am
| trying to do. I've been looking at beautiful soup so far I don't know
| how to perform search and replace within it.

Well the BeautifulSoup web page helped me:
  http://www.crummy.com/software/BeautifulSoup/documentation.html

Here's a function from a script I wrote to bulk edit a web site. I was
replacing OBJECT and EMBED nodes with modern versions:

  def recurse(node):
    global didmod
    for O in node.contents:
      if isinstance(O,Tag):
        for attr in 'src', 'href':
          if attr in O:
            rurl=O[attr]
            rurlpath=pathwrt(rurl,SRCPATH)
            if not os.path.exists(rurlpath):
              print >>sys.stderr, "%s: MISSING: %s" % (SRCPATH, rurlpath,)
        O2=None
        if O.name == "object":
          O2, SUBOBJ = fixmsobj(O)
        elif O.name == "embed":
          O2, SUBOBJ = fixembed(O)
        if O2 is not None:
          O.replaceWith(O2)
          SUBOBJ.replaceWith(O)
          ##print >>sys.stderr, "%s: update: new OBJECT: %s" % (SRCPATH, str(O2), )
          didmod=True
          continue
        recurse(O)

but you have only to change it a little to modify things that aren't Tag
objects. The calling end looks like this:

  with open(SRCPATH) as srcfp:
    srctext = srcfp.read()
  SOUP = BeautifulSoup(srctext)
  didmod = False        # icky global set by recurse()
  recurse(SOUP)
  if didmod:
    srctext = str(SOUP)

If didmod becomes True we recompute srctext and resave the file (or save it
to a copy).

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Democracy is the theory that the people know what they want, and deserve to
get it good and hard.   - H.L. Mencken



More information about the Python-list mailing list