[Cython] Specializing str methods?

Stefan Behnel stefan_ml at behnel.de
Thu May 19 21:35:17 CEST 2011


Hi,

the right place to discuss this is the cython-devel mailing list.

John Ehresman, 19.05.2011 18:17:
> On 4/13/11 11:23 AM, Stefan Behnel wrote:
>> You can add BuiltinMethod entries for the "str" type in Builtin.py and
>> map them to a function name like "__Pyx_Str_PyString_WhatEver()". Then,
>> define "utility_code" code blocks for each of the methods that #defines
>> that name to the appropriate PyString_*() or PyUnicode_*() C-API
>> function, depending on the Python version it runs with. There are a lot
>> of examples for that in Builtin.py already.
>
> I've finally got back to this and have the start of an implementation. I've
> attached a diff, though I'm not sure what the preferred workflow is. Should
> I push to github and / or create a bug in trac?

It's best to create a pull request on github, potentially accompanied by a 
request for review on the developer mailing list. We get a notification 
about pull requests, but it's still better to give us an idea in what state 
you think your patch is and what kind of review you need. The review can 
then happen by commenting on the changes directly in github.


> I'll probably add more
> methods, but wanted to get feedback before doing so.

Certainly the right decision.


> I also intend to contribute other patches in the future.

Please do. :)


> One thing that I didn't expect is that I originally declared the method as
> BuiltinMethod("startswith", "TT", "O", "_Pyx_StrStartsWith",
> utility_code=str_startswith_utility_code),
> and expected the specialization to only be used when both self and arg
> where a str, but instead cython always used the specialization and added a
> check for arg being a str before the specialization was called. I ended up
> writing a function that works for both unicode & bytes self and an object arg.

You don't really have to care about the type of the argument when applying 
this optimisation. As long as it's clear that the object that provides the 
method is a "str" (although it's potentially None even then!), it's clear 
what the method will do when being called, even if we don't know the 
argument type at compile time. Just provide a fallback, as you did anyway.

I'll comment on your patch below.

diff --git a/Cython/Compiler/Builtin.py b/Cython/Compiler/Builtin.py
+str_startswith_utility_code = UtilityCode(
+proto="""


"proto" is for the forward declaration of the function signature. Except 
for very simple cases (short macros) the implementation goes into "impl".


+static PyObject* _Pyx_StrStartsWith(PyObject* self, PyObject* arg)
+{
+  PyObject* ret;
+
+  if ( PyBytes_CheckExact(self) && PyBytes_CheckExact(arg) ) {
+    if ( PyBytes_GET_SIZE(self) < PyBytes_GET_SIZE(arg) )
+      ret = Py_False;
+    else if ( memcmp(PyBytes_AS_STRING(self), PyBytes_AS_STRING(arg),
+                     PyBytes_GET_SIZE(arg)) == 0 )
+      ret = Py_True;
+    else
+      ret = Py_False;
+    Py_INCREF(ret);
+    return ret;
+  }
+
+  if ( PyUnicode_CheckExact(self) && PyUnicode_CheckExact(arg) ) {
+    if ( PyUnicode_GET_SIZE(self) < PyUnicode_GET_SIZE(arg) )
+      ret = Py_False;
+    else if ( memcmp(PyUnicode_AS_UNICODE(self), PyUnicode_AS_UNICODE(arg),
+                     PyUnicode_GET_DATA_SIZE(arg)) == 0 )
+      ret = Py_True;
+    else
+      ret = Py_False;
+    Py_INCREF(ret);
+    return ret;
+  }
+
+  return PyObject_CallMethod(self, "startswith", "O", arg);


I'd split this up into two functions that can be applied to different types 
independently, and wrap it in a function that calls either one depending on 
PY_MAJOR_VERSION. After all, "str" is *known* to be "bytes" in Py2 and 
"unicode" in Py3.


@@ -522,8 +557,12 @@ builtin_types_table = [
                                      ]),

      ("bytes",   "PyBytes_Type",    []),
-    ("str",     "PyString_Type",   []),
+    ("str",     "PyString_Type",   [BuiltinMethod("startswith", "TO", "O", 
"_Pyx_StrStartsWith",
+ 
utility_code=str_startswith_utility_code),
+                                    ]),


It's better to return a "bint" to avoid returning Python objects for 
True/False. The type character for that is "b". There's a 
"__Pyx_PyObject_IsTrue(obj)" macro that you can use to convert the fallback 
return value. It's usually also ok to copy error handling code from 
CPython, and to raise the exception directly without going into the 
fallback case.


      ("unicode", "PyUnicode_Type",  [BuiltinMethod("join",  "TO",   "T", 
"PyUnicode_Join"),
+                                    BuiltinMethod("startswith", "TO", "O", 
"_Pyx_StrStartsWith",
+ 
utility_code=str_startswith_utility_code),
                                      ]),


This is already optimised in Cython/Compiler/Optimize.py ("tailmatch"), 
basically because it allows different numbers of arguments that need to be 
dealt with. It may be worth going the same route for "str". The decision 
usually depends on how complex the code transformation is. The method table 
in Builtins is clearly limited.



diff --git a/tests/run/stropts.pyx b/tests/run/stropts.pyx
@@ -0,0 +1,84 @@
+cimport cython
+
+ at cython.test_assert_path_exists(
+    '//AttributeNode[@member="_Pyx_StrStartsWith"]')
+def str_startswith_str():
+    """
+    >>> str_startswith_str()
+    True
+    """
+
+    cdef str a
+    cdef str b
+
+    a = ''
+    b = ''
+    return a.startswith(b)
[...]

I usually just implement one or a few test functions and use the doctest to 
pass in different values.

Stefan


More information about the cython-devel mailing list