[Python-Dev] Better text processing support in py2k?

Fredrik Lundh fredrik@pythonware.com
Thu, 30 Dec 1999 12:05:45 +0100


Tim Peters is back from his vacation:
> > While I don't want to turn Python into Perl, I would like to see
> > it do a better job of what most people probably use the language
> > for.  Here is a very short list of things I think need attention:
> >
> >     1. [*A* clear way to do memory- and time-efficient textfile
> >         input]
> 
> I agree, but unsure how to fix it.  The best way to write this now is
> 
>     # f is some open file object.
>     while 1:
>         lines = f.readlines(BUFSIZE)
>         if not lines:
>             break
>         for line in lines:
>             process(line)
> 
> and it's not something anyone figures out on their own -- or enjoys typing
> or explaining afterwards.
> 
> Perl gets its line-at-a-time speed by peeking and poking C FILE structs
> directly in compiler- and platform-specific ways -- ways that vendors
> *should* have done in their own fgets implementations, but almost never do.
> I have no idea whether it works well with Perl's nascent notions of
> threading, but in the absence of that "the system" doesn't know Perl is
> cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one
> line at a time -- even mixing in C-level ungetc calls works (well, sometimes
> <0.1 wink -- they don't always peek and poke enough fields>)).
> 
> The Python QIO extension module is much easier to port but less compatible
> (it doesn't use stdio, so QIO-opened files don't play well with others) and
> slower (although that's likely repairable -- he's got two passes over the
> buffer where one hairier pass should suffice).

we have something called SIO which uses memory mapping
where possible, and just a more aggressive read-ahead for
other cases.  on a windows box, a traditional while/readline
loop runs 3-5 times faster than before.  with SRE instead of
re, a while/readline/match loop runs up to 10 times faster
than before.

note that this is without *any* changes to the Python
source code...

> >     2. The re module needs to be sped up, if not to catch up with
> >        Perl, then to catch up with the deprecated regex module.
> 
> The irony here is that the re engine is very often unboundedly faster than
> the regex engine -- provided you're chewing over large strings.  Some tests
> /F ran showed that the length-independent *overhead* of invoking re is about
> 10x higher than for regex.  Presumably the bulk of that is due to re.py,
> i.e. that you get to the re engine via going thru Python layers on your way
> in and out, while regex was pure C.

I've attached some old benchmarks.  I think the current code
base is a bit faster, but you get the idea.

> In any case, /F is working on a new engine (for Unicode), and I believe he
> has this all well in hand.

with a little luck, the new module will replace both pcre
and regex...

not to mention that it's fairly easy to write your own front-
end to the matching engine -- the expression parser and the
compiler are both written in good old python.

</F>

$ python sre_bench.py
          0     5    50   250  1000  5000 25000
----- ----- ----- ----- ----- ----- ----- -----
search for Python|Perl in Perl ->
sre8  0.007 0.008 0.010 0.010 0.020 0.073 0.349
sre16 0.007 0.007 0.008 0.010 0.020 0.075 0.353
re    0.097 0.097 0.101 0.103 0.118 0.175 0.480
regex 0.007 0.007 0.009 0.020 0.059 0.271 1.320

search for (Python|Perl) in Perl ->
sre8  0.007 0.007 0.007 0.010 0.020 0.074 0.344
sre16 0.007 0.007 0.008 0.010 0.020 0.074 0.347
re    0.110 0.104 0.111 0.115 0.125 0.184 0.559
regex 0.006 0.006 0.009 0.019 0.057 0.285 1.432

search for Python in Python ->
sre8  0.007 0.007 0.007 0.011 0.021 0.072 0.387
sre16 0.007 0.007 0.008 0.010 0.022 0.082 0.365
re    0.107 0.097 0.105 0.102 0.118 0.175 0.511
regex 0.009 0.008 0.010 0.018 0.036 0.139 0.708

search for .*Python in Python ->
sre8  0.008 0.007 0.008 0.011 0.021 0.079 0.379
sre16 0.008 0.008 0.008 0.011 0.022 0.075 0.402
re    0.102 0.108 0.119 0.183 0.400 1.545 7.284
regex 0.013 0.019 0.072 0.318 1.231 8.035 45.366

search for .*Python.* in Python ->
sre8  0.008 0.008 0.008 0.011 0.021 0.080 0.383
sre16 0.008 0.008 0.008 0.011 0.021 0.079 0.395
re    0.103 0.108 0.119 0.184 0.418 1.685 8.378
regex 0.013 0.020 0.073 0.326 1.264 9.961 46.511

search for .*(Python) in Python ->
sre8  0.007 0.008 0.008 0.011 0.021 0.077 0.378
sre16 0.007 0.008 0.008 0.011 0.021 0.077 0.444
re    0.108 0.107 0.134 0.240 0.637 2.765 13.395
regex 0.026 0.112 3.820 87.322 (skipped)

search for .*P.*y.*t.*h.*o.*n.* in Python ->
sre8  0.010 0.010 0.014 0.031 0.093 0.419 2.212
sre16 0.010 0.011 0.014 0.030 0.093 0.419 2.292
re    0.112 0.121 0.195 0.521 1.747 8.298 40.877
regex 0.026 0.048 0.248 1.148 4.550 24.720 ...

(searching for patterns in padded strings; sre8
is the sre engine compiled for 8-bit characters,
sre16 is the same engine compiled for 16-bit
characters)