Multiline regex
Steven D'Aprano
steve-REMOVE-THIS at cybersource.com.au
Wed Jul 21 20:25:58 EDT 2010
On Wed, 21 Jul 2010 10:06:14 -0500, Brandon Harris wrote:
> what do you mean by slurp the entire file? I'm trying to use regular
> expressions because line by line parsing will be too slow. And example
> file would have somewhere in the realm of 6 million lines of code.
And you think trying to run a regex over all 6 million lines at once will
be faster? I think you're going to be horribly, horribly disappointed.
And then on Wed, 21 Jul 2010 10:42:11 -0500, Brandon Harris wrote:
> I could make it that simple, but that is also incredibly slow and on a
> file with several million lines, it takes somewhere in the league of
> half an hour to grab all the data. I need this to grab data from many
> many file and return the data quickly.
What do you mean "grab" all the data? If all you mean is read the file,
then 30 minutes to read ~ 100MB of data is incredibly slow and you're
probably doing something wrong, or you're reading it over a broken link
with very high packet loss, or something.
If you mean read the data AND parse it, then whether that is "incredibly
slow" or "amazingly fast" depends entirely on how complicated your parser
needs to be.
If *all* you mean is "read the file and group the lines, for later
processing", then I would expect it to take well under a minute to group
millions of lines. Here's a simulation I ran, using 2001000 lines of text
based on the examples you gave. It grabs the blocks, as required, but
does no further parsing of them.
def merge(lines):
"""Join multiple lines into a single block."""
accumulator = []
for line in lines:
if line.lower().startswith('createnode'):
if accumulator:
yield ''.join(accumulator)
accumulator = []
accumulator.append(line)
if accumulator:
yield ''.join(accumulator)
def test():
import time
t = time.time()
count = 0
f = open('/steve/test.junk')
for block in merge(f):
# do some make-work
n = sum([1 for c in block if c in '1234567890'])
count += 1
print "Processed %d blocks from 2M+ lines." % count
print "Time taken:", time.time() - t, "seconds"
And the result on a low-end PC:
>>> test()
Processed 1000 blocks from 2M+ lines.
Time taken: 17.4497909546 seconds
--
Steven
More information about the Python-list
mailing list