[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 12:16:25 CEST 2005

Walter Dörwald wrote:
> This is caused by the chances to the codecs in 2.4. Basically the codecs 
> no longer rely on C's readline() to do line splitting (which can't work 
> for UTF-16), but do it themselves (via unicode.splitlines()).

That explains why you get any calls to IsLineBreak; it doesn't explain
why you get so many of them.

I investigated this a bit, and one issue seems to be that
StreamReader.readline performs splitline on the entire input, only to
fetch the first line. It then joins the rest for later processing.
In addition, it also performs splitlines on a single line, just to
strip any trailing line breaks.

The net effect is that, for a file with N lines, IsLineBreak is invoked
up to N*N/2 times per character (atleast for the last character).

So I think it would be best if Unicode characters exposed a .islinebreak
method (or, failing that, codecs just knew what the line break
characters are in Unicode 3.2), and then codecs would split off
the first line of input itself.

>>After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was
>>getting called 51 million times. Our code is 1.2 million characters, so I
>>hardly think it makes sense to call IsLinebreak 50 times for each character;
>>and we're not even importing our entire source tree on every invocation.
> 
> 
> But if you're using CGI, you're importing your source on every 
> invocation.

Well, no. Only the CGI script needs to be parsed every time; all modules
could load off bytecode files.

Which suggests that Keir Mierle doesn't use bytecode files, I think he
should.

Regards,
Martin