[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

Wed Aug 24 14:24:42 CEST 2005

Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
>>I think it's worthwhile reconsidering this approach for
>>character type queries that do no involve a huge number
>>of code points.
> 
> 
> I would advise against that. I measure both versions
> (your version called PyUnicode_IsLinebreak2) with the
> following code
> 
> volatile int result;
> void unibench()
> {
> #define REPS 10000000000LL
>   long long i;
>   clock_t s1,s2,s3,s4,s5;
>   s1 = clock();
>   for(i=0;i<REPS;i++)
>     result = _PyUnicode_IsLinebreak('(');
>   s2 = clock();
>   for(i=0;i<REPS;i++)
>     result = PyUnicode_IsLinebreak2('(');
>   s3 = clock();
>   for(i=0;i<REPS;i++)
>     result = _PyUnicode_IsLinebreak('\n');
>   s4 = clock();
>   for(i=0;i<REPS;i++)
>     result = PyUnicode_IsLinebreak2('\n');
>   s5 = clock();
>   printf("f1, (: %d\nf2, (: %d\nf1, CR: %d\n, f2, CR: %d\n",
> 	 (int)(s2-s1),(int)(s3-s2),(int)(s4-s3),(int)(s5-s4));
> }
> 
> and got those numbers
> 
> f1, (: 13210000
> f2, (: 13300000
> f1, CR: 13220000
> , f2, CR: 13250000
> 
> What can be seen is that performance the two versions is nearly
> identical, with the code currently used being slightly better.
> What can also be seen is that, on my machine, 1e10 calls to
> IsLinebreak take 13.2 seconds. So 51  Mio calls take about 70ms.

Your test is somewhat biased: the current solution
works using type records, so it has to swap in a new
record for each character you test. In you benchmark,
the same character is tested over and over again
and the type record likely already stored in the
CPU cache.

The .splitlines() routine itself calls the above
function for each and every character in the string,
so quite a few of these type records have to be
looked up.

Here's a version that uses os.py as basis:

#include <stdlib.h>
#include <time.h>
#include "Python.h"

int _PyUnicode_IsLinebreak16(register const Py_UNICODE ch)
{
    switch (ch) {
    case 0x000A: /* LINE FEED */
    case 0x000D: /* CARRIAGE RETURN */
    case 0x001C: /* FILE SEPARATOR */
    case 0x001D: /* GROUP SEPARATOR */
    case 0x001E: /* RECORD SEPARATOR */
    case 0x0085: /* NEXT LINE */
    case 0x2028: /* LINE SEPARATOR */
    case 0x2029: /* PARAGRAPH SEPARATOR */
	return 1;
    default:
	return 0;
    }
}

#define REPS 10000
#define BUFFERSIZE 30000

int main(void)
{
    long i, j;
    clock_t s1,s2,s3;
    char *buffer;
    FILE *datafile;
    long filelen;
    int result;

    datafile = fopen("os.py", "rb");
    if (datafile == NULL) {
	printf("could not find os.py\n");
	return -1;
    }
    buffer = (char *)malloc(BUFFERSIZE);
    filelen = fread(buffer, 1, BUFFERSIZE, datafile);
    printf("filelen=%li bytes\n", filelen);

    s1 = clock();

    /* Python 2.4 */
    for(i = 0; i < REPS; i++)
	for (j = 0; j < filelen; j++)
	    result = _PyUnicode_IsLinebreak((Py_UNICODE)buffer[j]);
    s2 = clock();

    /* Python 1.6 */
    for(i = 0; i < REPS; i++)
	for (j = 0; j < filelen; j++)
	    result = _PyUnicode_IsLinebreak16((Py_UNICODE)buffer[j]);
    s3 = clock();

    printf("2.4: %d\n"
	   "1.6: %d\n",
	   (int)(s2-s1),
	   (int)(s3-s2));
    return 0;
}

Output, compiled with -O3:

filelen=23147 bytes
2.4: 2570000
1.6: 1230000

That's a factor 2.

> The reported performance problem is more likely in the allocation
> of all these splitlines results, and the copying of the same
> strings over and over again.

True.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 23 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::