HTML parsing confusion

Tue Jan 22 09:31:29 EST 2008

On Jan 22, 7:44 am, Alnilam <alni... at gmail.com> wrote:
> ...I move from computer to
> computer regularly, and while all have a recent copy of Python, each
> has different (or no) extra modules, and I don't always have the
> luxury of downloading extras. That being said, if there's a simple way
> of doing it with BeautifulSoup, please show me an example. Maybe I can
> figure out a way to carry the extra modules I need around with me.

Pyparsing's footprint is intentionally small - just one pyparsing.py
file that you can drop into a directory next to your own script.  And
the code to extract paragraph 5 of the "Dive Into Python" home page?
See annotated code below.

-- Paul

from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag
import urllib
import textwrap

page = urllib.urlopen("http://diveintopython.org/")
source = page.read()
page.close()

# define a simple paragraph matcher
pStart,pEnd = makeHTMLTags("P")
paragraph = pStart.suppress() + SkipTo(pEnd) + pEnd.suppress()

# get all paragraphs from the input string (or use the
# scanString generator function to stop at the correct
# paragraph instead of reading them all)
paragraphs = paragraph.searchString(source)

# create a transformer that will strip HTML tags
tagStripper = anyOpenTag.suppress() | anyCloseTag.suppress()

# get paragraph[5] and strip the HTML tags
p5TextOnly = tagStripper.transformString(paragraphs[5][0])

# remove extra whitespace
p5TextOnly = " ".join(p5TextOnly.split())

# print out a nicely wrapped string - so few people know
# that textwrap is part of the standard Python distribution,
# but it is very handy
print textwrap.fill(p5TextOnly, 60)