Need a specific sort of string modification. Can someone help?

Roy Smith roy at panix.com
Sun Jan 6 12:28:55 EST 2013


In article <roy-103D43.15470305012013 at news.panix.com>,
 Roy Smith <roy at panix.com> wrote:

> It's rare to find applications these days that are truly CPU bound.  
> Once you've used some reasonable algorithm, i.e. not done anything in 
> O(n^2) that could have been done in O(n) or O(n log n), you will more 
> often run up against I/O speed, database speed, network latency, memory 
> exhaustion, or some such as the reason your code is too slow.

Well, I just found a counter-example :-)

I've been doing some log analysis.  It's been taking a grovelingly long 
time, so I decided to fire up the profiler and see what's taking so 
long.  I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots 
might be (looking up IP addresses in the geolocation database, or 
producing some pretty pictures using matplotlib).  It was just a matter 
of figuring out which it was.

As with most attempts to out-guess the profiler, I was totally, 
absolutely, and embarrassingly wrong.

It turns out we were spending most of the time parsing timestamps!  
Since there's no convenient way (I don't consider strptime() to be 
convenient) to parse isoformat strings in the standard library, our 
habit has been to use the oh-so-simple parser from the third-party 
dateutil package.  Well, it turns out that's slow as all get-out 
(probably because it's trying to be smart about auto-recognizing 
formats).  For the test I ran (on a few percent of the real data), we 
spent 90 seconds in parse().

OK, so I dragged out the strptime() docs and built the stupid format 
string (%Y-%m-%dT%H:%M:%S+00:00).  That got us down to 25 seconds in 
strptime().

But, I could also see it was spending a significant amount in routines 
that looked like they were computing things like day of the week that we 
didn't need.  For what I was doing, we only really needed the hour and 
minute.  So I tried:

        t_hour = int(date[11:13])
        t_minute = int(date[14:16])

that got us down to 12 seconds overall (including the geolocation and 
pretty pictures).

I think it turns out we never do anything with the hour and minute other 
than print them back out, so just

       t_hour_minute = date[11:16]

would probably be good enough, but I think I'm going to stop where I am 
and declare victory :-)



More information about the Python-list mailing list