[Python-Dev] Syntax for parsing tuples/lists into C arrays

Thu Jun 25 17:40:58 CEST 2009

The current C API does not support directly parsing into C arrays. Parsing tuples into C arrays can be greatly facilitated with the introduction of a natural extension of the current parsing syntax. I'd like to propose a syntax extension to facilitate that kind of tasks, and would appreciate your feedbacks. The extension comes from the way strings are parsed in current C API: 

{
const char * s; int i;
PyArg_ParseTyple(args, "s#", &s, &i);
}

Consider for a moment the string as a sequence of char's, which by the example above is parsed into an array of char's. Note that in python, a tuple of some type (e.g. int's), is also a sequence of that type, This perspective provides a natural way to extend the the use of "#" symbol -- what is missing is how to signal a tuple as a sequence, without interfering with current fixed-length tuple parsing syntax. To signal a tuple as a sequence of a given type of objects, "(type:)" can be used. For example, "(i:)" signals a tuple of integers as array of integers. The ":" is chosen to denote possible repetition of the whole pattern inside the parentheses. 

Now consider 3 different scenarios of parsing tuples into arrays.

Scenario (1) Unkown size. Here are some examples:

{
//parsing a tuple and its size
int i, *t;
PyArg_ParseTuple(args, "(i:)#", &t, &i);
}

{
//parsing a matrix of i rows and j columns:
int i, j, **t;
PyArg_ParseTuple(args, "((i:):)##", &t, &i, &j);
}

Here is an informal explanation of "((i:):)##": the '#' can be seen as a peeling operator, which peels away the outer most "(...:)" that denote a "free sequence" (i.e. sequence of unknown size) with some pattern "...".

For types like 's' or its similar types, pretend as if there is an invisible pair of parentheses around it. For example, "(s:)##" matches a tuple of strings of some fixed length. Here the first '#' peels away the parentheses, leaving only "s#", which is equivalent to "(*char*:)#" in concept, where "*char*" denote the character type (not exist in Python). As the second '#' is outside the tuple, it is shared by all strings inside, so they _must_ have the same length.

If there is nothing to peel for a '#', it is considered a dangling '#', an error. Now consider some more interesting examples:

{
int i, *t, *v;
PyArg_ParseTuple(args, "(ii:)#", &t, &v, &i);
//args=(1,2,3,4,5,6)
//t={1,3,5}
//v={2,4,6}
//i=3
}

In that example, a pattern is provided inside the sequence "(ii:)", which can repeat itself multiple times. The repeated times is extracted by the '#' symbol. Below is another example:

{
int i,j, *t, **v;
PyArg_ParseTuple(args, "(i(i:):)##", &t, &v, &i, &j);
//args=(1,(3,),5,(7,))
//t={1,5}
//v={{3},{7}}
//i=2, j=1
}

And another one:

{
int i,j, **t, **v;
PyArg_ParseTuple(args, "((i:)(i:):)##", &t, &v, &i, &j);
//args=((1,),(3,),(5,),(7,))
//t={{1},{5}}
//v={{3},{7}}
//i=2, j=1
}

Note the second '#' is applied to both the free sequences inside the outer tuple. This indicates that all the inside tuples must be of the same length. What if they have their own lengths? We provide a separate matching '#' for each inside tuple:

{
int i,j,k, **t, **v;
PyArg_ParseTuple(args, "((i:)(i:):)#(##)", &t, &v, &i, &j,&k);
//args=((1,),(3,4),(5,),(7,8))
//t={{1},{5}}
//v={{3,4},{7,8}}
//i=2, j=1, k=2
}

The syntax can also deal with more complicated situations, such as an irregular 2-dimensional array, where each row has a different row length:

{
int i, *j, **t;
PyArg_ParseTuple(args, "((i:)#:)#", &t, &j, &i);
for(a=0; a)
//t={{{1,2},{1,2}},{{1,2}}}
//i=2, k=2
//j={2,1}
}

This syntax can be used for parsing a tuple of strings of different lengths:

{
int i, *j;
const char ** s;
PyArg_ParseTuple(args, "(s#:)#", &s, &j, &i);
//args=("ab", "cde")
//s={"ab", "cde"}
//j={2, 3}
//i=2
}

Scenario (2) If you know the size of a tuple before hand, use this syntax: "(*type*integer)", meaning there are exactly 'integer' number of *type* objects in the tuple.

{
int *t;
//
PyArg_ParseTuple(args, "(i5)", &t);
//args = (1,2,3,4,5)
//t = {1,2,3,4,5}
}

A complicated case:
{
int **t;
PyArg_ParseTuple(args, "((i5)2)", &t);
//args=((1,2,3,4,5),(5,6,7,8,9))
//t={{1,2,3,4,5},{5,6,7,8,9}}
}

Another complicated case:

{
int **t, **v, i;
PyArg_ParseTuple(args, "(i3i2:)#", &t,&v,&i);
//args=(1,2,3,4,5,6,7,8,9,10)
//t={{1,2,3},{6,7,8}}
//v={{4,5},{9,10}}
//i=2
}

One more thing: the difference between "(ii)" and "(i2)" is that the former yields two single integers, while the latter gives an array of length two.

Scenario (3) Concatenating arrays while parsing. Sometimes you would like to have a matrix parsed into a single array. We can use a "(+" to signal that the arrays parsed from the elements in that tuple should be concatenated into a single array.

{
int i,j, *t;
PyArg_ParseTuple(args, "(+(i))##", &t, &i, &j);
//args=((1,2,3),(4,5,6))
//t={1,2,3,4,5,6}
//i=2, j=3
}

Another example:

{
int i, *t;
PyArg_ParseTuple(args, "(+(i2))#", &t, &i);
}

Still another example, with strings:
{
int i, *j;
const char * t; //string
PyArg_ParseTuple(args, "(+(+s#2))#", &t, &j, &i);
//args=(("ab","c"),("d","efg"))
//t="abcdefg"
//i=2
////notice the size arrays are concatenated similarly
//j={2,1,1,3} //without '+': j={{2,1},{1,3}}

const char ** sa; 
//The '*' in place of '#' to discard the size info
PyArg_ParseTuple(args, "(+(s*2))", &sa);
//args=(("ab","c"),("d","efg"))
//sa = {"ab","c","d","efg"}

PyArg_ParseTuple(args, "((+s*2))", &sa);
//args=(("ab","c"),("d","efg"))
//sa = { "abc", "defg" }
}

And finally, an example with multi-item pattern:

{
int *t, *v, i;
PyArg_ParseTuple(args, "(+i1i2:)#", &t, &v, &i);
//args=(1,2,3,4,5,6)
//t={1,4}
//v={2,3,5,6}
//i=2
}

Epilogue:

This syntax can be applied to lists: just use "[...:]" instead of "(...:)". It can also be used for building tuples or lists out of arrays.

That's it. Comments?

Yingjie