Help Parsing XML Namespaces with BeautifulSoup

Paul McGuire ptmcg at austin.rr.com
Sun Feb 18 02:08:52 EST 2007


On Feb 17, 6:55 pm, "snewma... at gmail.com" <snewma... at gmail.com> wrote:
> I'm trying to parse out some XML nodes with namespaces using
> BeautifulSoup. I can't seem to get the syntax correct. It doesn't like
> the colon in the tag name, and I'm not sure how to refer to that tag.
>
> I'm trying to get the attributes of this tag:
>
> <yweather:forecast day="Sun" date="18 Feb 2007" low="39" high="55"
> text="Partly Cloudy/Wind" code="24">
>
> The only way I've been able to get it is by doing a findAll with
> regex. Is there a better way?
>
> ----------
>
> from BeautifulSoup import BeautifulStoneSoup
> import urllib2
>
> url = 'http://weather.yahooapis.com/forecastrss?p=33609'
> page = urllib2.urlopen(url)
> soup = BeautifulStoneSoup(page)
>
> print soup['yweather:forecast']
>
> ----------

If you are just trying to extract a single particular tag, pyparsing
can do this pretty readily, and the results returned make it very easy
to pick out the tag attribute values.

-- Paul


from pyparsing import makeHTMLTags
import urllib2

url = 'http://weather.yahooapis.com/forecastrss?p=78732'
page = urllib2.urlopen(url)
html = page.read()
page.close()

forecastTag = makeHTMLTags('yweather:forecast')[0]

for fc in forecastTag.searchString(html):
    print fc.asList()
    print "Date: %(date)s,  hi:%(high)s lo:%(low)s" % fc
    print

Prints:

['yweather:forecast', ['day', 'Sat'], ['date', '17 Feb 2007'], ['low',
'34'], ['high', '67'], ['text', 'Clear'], ['code', '31'], True]
Date: 17 Feb 2007,  hi:67 lo:34

['yweather:forecast', ['day', 'Sun'], ['date', '18 Feb 2007'], ['low',
'42'], ['high', '65'], ['text', 'Sunny'], ['code', '32'], True]
Date: 18 Feb 2007,  hi:65 lo:42




More information about the Python-list mailing list