HTML Parser

Greg Jorgensen gregj at pobox.com
Sat Dec 30 23:57:55 EST 2000


"Voitenko, Denis" <dvoitenko at qode.com> wrote:

> I am trying to write an HTML parser.

This has been done--look at the htmllib and sgmllib modules.

> I am starting off with a simple one
> like so:
>
> # html_parser.py
> import re
> import string
>
> newline=re.compile('\n')
> HTMLtags=re.compile('<.*>')

.* will match as many characters as possible, including (in your case) < and
>. You want this pattern, which will match as few characters as possible
surrounded by < and >:

    HTMLtags = re.compile('<.*?>')

You can split using a literal character instead of a regular expression:

    line = lines.split('\n')

The readlines() method the file object will save you the trouble, but you
don't need to split the input into lines at all if you just want to find the
HTML tags.

Here's my version:

# simple html tag processor

import sys
import re

rx = re.compile('(<.*?>)', re.MULTILINE)

# HTML text will come from a file.read()
html = '<html>\n<head>\n\t<title>Page Title</title>\n</head>\n<body>Hello,
World!</body>\n</html>\n'

# split the text into tags and stuff between tags
# the re.split() creates empty list elements for adjacent matches--those can
be ignored
# uppercase anything inside <..> and output the converted text

for s in rx.split(html):
    if s == '':            # re.split() artifact
        continue
    elif (s[0] == '<') and (s[-1] == '>'):    # <tag>
        sys.stdout.write(s.upper())
    else:          # everything else
        sys.stdout.write(s)

--
Greg Jorgensen
PDXperts
Portland, Oregon, USA
gregj at pobox.com






More information about the Python-list mailing list