[issue7330] PyUnicode_FromFormat segfault

Fri Feb 18 08:36:12 CET 2011

STINNER Victor <victor.stinner at haypocalc.com> added the comment:

> > It looks like your patch fixes #10829: you should add tests for that, you can just reuse the tests of my patch (attached to #10829).
> 
> Sorry, but I think my patch doesn't fix #10829.

Ah ok, so don't add failing tests :-)

> > You should also avoid the creation of a temporary unicode object (it can be slow if precision is large) using PySequence_GetSlice(). Py_UNICODE_COPY() does already truncate the string because you can pass an arbitrary length.
> 
> In order to use Py_UNICODE_COPY, I have to create a unicode object with required length first.

No you don't. You can copy a substring of the input string with
Py_UNICODE_COPY: just pass a smaller length.

> > With your patch, "%.200s" truncates the input string to 200 *characters*, but I think that it should truncate to 200 *bytes*, as printf does.
> 
> Sorry, I don't understand. The result of PyUnicode_FromFormatV() is a unicode object. Then how to truncate to 200 *bytes*?

You can truncate the input char* on the call to PyUnicode_DecodeUTF8:
pass a size smaller than strlen(s).

case 's':
{
    /* UTF-8 */
    const char *s = va_arg(count, const char*);
    PyObject *str = PyUnicode_DecodeUTF8(s, strlen(s), "replace");
    if (!str)
        goto fail;
    n += PyUnicode_GET_SIZE(str);
    /* Remember the str and switch to the next slot */
    *callresult++ = str;
    break;
}

I don't know if we should truncate to a number of bytes, or a number of
characters.

> > I don't like this change because I hate having to compute manually strings length. It should that it would be easier if you format directly strings with width and precision at step 3, instead of doing it at step 4: so you can just read the length of the formatted string, and it avoids having to handle width/precision in two steps (which may be inconsistent :-/).
> 
> Do you mean combine step 3 and step 4 together? Currently step 3 is just to compute the biggest width value and step 4 is to compute exact width and do the real format work. Only by doing real format we can get the exact width of a string. So I have to compute each width twice in both step 3 and step 4. Is combining the two steps in to one a good idea?

"Do you mean combine step 3 and step 4 together?"

Yes, but I am no more sure that it is the right thing to do.

> > Can you add tests for "%.s"? I would like to know if "%.s" is different than "%s" :-)
> 
> Err, '%.s' causes unexpected result both with and without my patch. Maybe it's still another bug?

If the fix (always have the same behaviour) is short, it would be nice
to include it in your patch.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7330>
_______________________________________