[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Tue Aug 23 22:10:21 CEST 2005

Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg
Wilson this summer

We're having a very strange problem with Python's unicode parsing of source
files. Basically, our CGI script was running extremely slowly on our production
box (a pokey dual-Xeon 3GHz w/ 4GB RAM and 15K SCSI drives). Slow to the tune
of 6-10 seconds per request. I eventually tracked this down to imports of our
source tree; the actual request was completing in 300ms, the rest of the time
was spent in __import__.

After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was
getting called 51 million times. Our code is 1.2 million characters, so I
hardly think it makes sense to call IsLinebreak 50 times for each character;
and we're not even importing our entire source tree on every invocation.

Our code is a fork of Trac, and originally had these lines at the top:

# -*- coding: iso8859-1 -*-  

This made me suspicious, so I removed all of them. The CGI execution time
immediately dropped to ~1 second. gprof revealed that
_PyUnicodeUCS2_IsLinebreak is not called at all anymore.

Now that our code works fast enough, I don't really care about this, but I
thought python-dev might want to know something weird is going on with unicode
splitlines.

I documented my investigation of this problem; if anyone wants further details,
just email me. (I'm not on python-dev)
http://www.third-bit.com/trac/argon/ticket/525

Thanks in advance,
Keir