How use XML parsing tools on this one specific URL?

Paul McGuire ptmcg at austin.rr.com
Sun Mar 4 21:25:09 EST 2007


On Mar 4, 11:42 am, "seber... at spawar.navy.mil"
<seber... at spawar.navy.mil> wrote:
> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
>
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
>
> I can't validate it and xml.minidom.dom.parseString won't work on it.
>
> If this was just some teenager's web site I'd move on.  Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?
>
> Chris

How about a pyparsing hack instead?  With English-readable expression
names and a few comments, I think this is fairly easy to follow.  Also
note the sample statement at then end showing how to use the results
names to access the individual data fields (much easier than indexing
into a 20-element list!).

(You should also verify you are not running afoul of any terms of
service related to the content of this page.)

-- Paul

=======================
from pyparsing import *
import urllib

# define matching elements
integer = Word(nums).setParseAction(lambda t:int(t[0]))
real = Combine(Word(nums) + Word(".",nums)).setParseAction(lambda
t:float(t[0]))
pct = real + Suppress("%")
date = Combine(Word(nums) + '/' + Word(nums))
tdStart,tdEnd = map(Suppress,makeHTMLTags("td"))
dollarUnits = oneOf("Mil Bil")

# stats are one of two patterns - single value or double value stat,
wrapped in HTML <td> tags
# also, attach parse action to make sure each matches only once
def statPattern(name,label,statExpr=real):
    if (isinstance(statExpr,And)):
        statExpr.exprs[0] = statExpr.exprs[0].setResultsName(name)
    else:
        statExpr = statExpr.setResultsName(name)
    expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr +
tdEnd
    return expr.setParseAction(OnlyOnce(lambda t:None))

def bistatPattern(name,label,statExpr1=real,statExpr2=real):
    expr = (tdStart + Suppress(label) + tdEnd +
            tdStart + statExpr1 + tdEnd +
            tdStart + statExpr2 + tdEnd).setResultsName(name)
    return expr.setParseAction(OnlyOnce(lambda t:None))

stats = [
    statPattern("last","Last Price"),
    statPattern("hi","52 Week High"),
    statPattern("lo","52 Week Low"),
    statPattern("vol","Volume", real + Suppress(dollarUnits)),
    statPattern("aveDailyVol_13wk","Average Daily Volume (13wk)", real
+ Suppress(dollarUnits)),
    statPattern("movingAve_50day","50 Day Moving Average"),
    statPattern("movingAve_200day","200 Day Moving Average"),
    statPattern("volatility","Volatility (beta)"),
    bistatPattern("relStrength_last3","Last 3 Months", pct, integer),
    bistatPattern("relStrength_last6","Last 6 Months", pct, integer),
    bistatPattern("relStrength_last12","Last 12 Months", pct,
integer),
    bistatPattern("sales","Sales", real+Suppress(dollarUnits), pct),
    bistatPattern("income","Income", real+Suppress(dollarUnits), pct),
    bistatPattern("divRate","Dividend Rate", real, pct | "NA"),
    bistatPattern("divYield","Dividend Yield", pct, pct),
    statPattern("curQtrEPSest","Qtr("+date+") EPS Estimate"),
    statPattern("curFyEPSest","FY("+date+") EPS Estimate"),
    statPattern("curPE","Current P/E"),
    statPattern("fwdEPSest","FY("+date+") EPS Estimate"),
    statPattern("fwdPE","Forward P/E"),
    ]

# create overall search pattern - things move faster if we verify that
we are positioned
# at a <td> tag before going through the MatchFirst group
statSearchPattern = FollowedBy(tdStart) + MatchFirst(stats)

# SETUP IS DONE - now get the HTML source
# read in web page
pg = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
stockHTML = pg.read()
pg.close()

# extract and merge statistics
ticker =
sum( statSearchPattern.searchString(stockHTML),ParseResults([]) )

# print them out
print ticker.dump()
print ticker.last, ticker.hi,ticker.lo,ticker.vol,ticker.volatility

-----------------------
prints:
[39.549999999999997, 43.32, 30.920000000000002, 2.3599999999999999,
2.7400000000000002, 40.920000000000002, 37.659999999999997,
0.72999999999999998, 1.5, 55, 15.5, 69, 9.8000000000000007, 62,
6.2999999999999998, 19.399999999999999, 586.29999999999995,
27.199999999999999, 0.0, 'NA', 0.0, 0.0, 0.78000000000000003,
2.1499999999999999, 19.399999999999999, 2.3900000000000001,
18.399999999999999]
- aveDailyVol_13wk: 2.74
- curFyEPSest: 2.15
- curPE: 19.4
- curQtrEPSest: 0.78
- divRate: [0.0, 'NA']
- divYield: [0.0, 0.0]
- fwdEPSest: 2.39
- fwdPE: 18.4
- hi: 43.32
- income: [586.29999999999995, 27.199999999999999]
- last: 39.55
- lo: 30.92
- movingAve_200day: 37.66
- movingAve_50day: 40.92
- relStrength_last12: [9.8000000000000007, 62]
- relStrength_last3: [1.5, 55]
- relStrength_last6: [15.5, 69]
- sales: [6.2999999999999998, 19.399999999999999]
- vol: 2.36
- volatility: 0.73
39.55 43.32 30.92 2.36 0.73




More information about the Python-list mailing list