[IronPython] speed

Thu Jul 27 08:15:19 CEST 2006

An issue on CodePlex -
http://www.codeplex.com/WorkItem/View.aspx?ProjectName=IronPython&WorkItemId=651-
was raised on my behalf to do with performance using BeautifulSoup (a
forgiving HTML parser).

Here's a simple test which does the parsing and the "prettifying" - the
process where BeautifulSoup rewrites the HTML in an attempt to make it
well-formed.

The benchmark processes 2 different urls, loads them into BeautifulSoup and
then reads out the "pretty" (or better-formed) html. I can't use urllib
because it's apparently not been implemented (see
http://www.codeplex.com/WorkItem/View.aspx?ProjectName=IronPython&WorkItemId=1368)
so what I've done is two scripts.

makeFiles.py script, which should be run in CPython, reads the urls and
writes them to files.
test.py is the actual benchmark, and the other one is the actual benchmark.
The code for both is at the end of this message

These are the results I'm getting:

CPython 2.4
------------------
ran test_getHtml in     0.00 seconds
ran test_load in        0.28 seconds
ran test_prettify in    0.05 seconds
ran benchmark in        0.33 seconds

IronPython 1.0 RC1
----------------------------
ran test_getHtml in     0.04 seconds
ran test_load in        2.49 seconds
ran test_prettify in    0.24 seconds
ran benchmark in        2.77 seconds

So you can see that IronPython is significantly slower than CPython on
BeautifulSoup parsing.
#---- makeFiles.py
import urllib

def test_getHtml(url):
    f = urllib.urlopen(url)
    html = f.read()
    f.close()

    return html

def saveFile(fName, data):
    f = open(fName, "w")
    f.write(data)
    f.close()
    return

urls = ["http://news.bbc.co.uk/2/hi/middle_east/5213602.stm", "
http://www.cnn.com/2006/US/07/25/highway.shootings.ap/index.html"]
files = ["c:\\bbc.html", "c:\\cnn.html"]

i = 0
for url in urls:
    fName = files[i]
    i += 1
    data = test_getHtml(url)
    saveFile(fName, data)

#test.py
#-----------------------------------------------------------------------------------------------------------------
#| Code Start
#-----------------------------------------------------------------------------------------------------------------
import sys
sys.path.append("C:\\Python24\\Lib")

from BeautifulSoup import BeautifulSoup
import time

def test_getFile(fileName):
    f = open(fileName, "r")
    html = f.read()
    f.close()

    return html

def test_load(html):
    s = BeautifulSoup(html)
    return s

def test_prettify(s):
    t = s.prettify()
    return t

files = ["c:\\bbc.html", "c:\\cnn.html"]
testCount = 2

benchmarkStart = time.clock()
time_getHtml = 0
time_load = 0
time_prettify = 0

for i in range(testCount):

    for file in files:
        fName = files[i]
        testStart = time.clock()
        html = test_getFile(fName)
        testEnd = time.clock()
        time_getHtml += testEnd - testStart

        testStart = time.clock()
        s = test_load(html)
        testEnd = time.clock()
        time_load += testEnd - testStart

        testStart = time.clock()
        t = test_prettify(s)
        testEnd = time.clock()
        time_prettify += testEnd - testStart

benchmarkEnd = time.clock()

print 'ran test_getHtml in \t%.2f seconds' % (time_getHtml)
print 'ran test_load in \t%.2f seconds' % (time_load)
print 'ran test_prettify in \t%.2f seconds' % (time_prettify)

print 'ran benchmark in \t%.2f seconds' % (benchmarkEnd - benchmarkStart)
#-----------------------------------------------------------------------------------------------------------------
#| Code End
#-----------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ironpython-users/attachments/20060727/0612078b/attachment.html>