[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 12:56:58 CEST 2005

M.-A. Lemburg wrote:
> I think it's worthwhile reconsidering this approach for
> character type queries that do no involve a huge number
> of code points.

I would advise against that. I measure both versions
(your version called PyUnicode_IsLinebreak2) with the
following code

volatile int result;
void unibench()
{
#define REPS 10000000000LL
  long long i;
  clock_t s1,s2,s3,s4,s5;
  s1 = clock();
  for(i=0;i<REPS;i++)
    result = _PyUnicode_IsLinebreak('(');
  s2 = clock();
  for(i=0;i<REPS;i++)
    result = PyUnicode_IsLinebreak2('(');
  s3 = clock();
  for(i=0;i<REPS;i++)
    result = _PyUnicode_IsLinebreak('\n');
  s4 = clock();
  for(i=0;i<REPS;i++)
    result = PyUnicode_IsLinebreak2('\n');
  s5 = clock();
  printf("f1, (: %d\nf2, (: %d\nf1, CR: %d\n, f2, CR: %d\n",
	 (int)(s2-s1),(int)(s3-s2),(int)(s4-s3),(int)(s5-s4));
}

and got those numbers

f1, (: 13210000
f2, (: 13300000
f1, CR: 13220000
, f2, CR: 13250000

What can be seen is that performance the two versions is nearly
identical, with the code currently used being slightly better.
What can also be seen is that, on my machine, 1e10 calls to
IsLinebreak take 13.2 seconds. So 51  Mio calls take about 70ms.

The reported performance problem is more likely in the allocation
of all these splitlines results, and the copying of the same
strings over and over again.

Regards,
Martin