[Cython] File encoding issue

Mon Nov 7 20:44:32 CET 2011

2011/11/7 Stefan Behnel <stefan_ml at behnel.de>:
> Vitja Makarov, 07.11.2011 19:28:
>>
>> 2011/11/6 Stefan Behnel:
>>>
>>> Vitja Makarov, 06.11.2011 18:10:
>>>>
>>>> When file encoding is specified cython generates two PyObject entries
>>>> for string consts one for the variable name and one for the string
>>>> constant.
>>>
>>> That's because the content may actually become different after decoding,
>>> even if the encoded byte sequence is identical. Note that decoding is
>>> only
>>> done in Py3. In Py2, the byte sequence is used, so both values are
>>> identical.
>>
>> If they are the identical after decoding isn't it better to have only
>> one of them?
>
> Well, yes. That's not trivial, though, because the decision is taken at C
> compile time. And the benefit tends to be negligible, because this case is
> really rare and the affected strings tend to be quite short.
>
>
>>>> Here is minimal example:
>>>> $ cat cplus.pyx
>>>> # -*- coding: koi8-r -*-
>>>> wtf = 'wtf'
>>>>
>>>> Generaets the following code:
>>>>
>>>> /* Implementation of 'cplus' */
>>>> static char __pyx_k__wtf[] = "wtf";
>>>> static char __pyx_k____main__[] = "__main__";
>>>> static char __pyx_k____test__[] = "__test__";
>>>> static PyObject *__pyx_n_s____main__;
>>>> static PyObject *__pyx_n_s____test__;
>>>> static PyObject *__pyx_n_s__wtf;
>>>> static PyObject *__pyx_n_s__wtf;
>>>>
>>>> ...
>>>>
>>>> static __Pyx_StringTabEntry __pyx_string_tab[] = {
>>>>   {&__pyx_n_s____main__, __pyx_k____main__, sizeof(__pyx_k____main__),
>>>> 0, 0, 1, 1},
>>>>   {&__pyx_n_s____test__, __pyx_k____test__, sizeof(__pyx_k____test__),
>>>> 0, 0, 1, 1},
>>>>   {&__pyx_n_s__wtf, __pyx_k__wtf, sizeof(__pyx_k__wtf), "koi8-r", 0, 1,
>>>> 1},
>>>>   {&__pyx_n_s__wtf, __pyx_k__wtf, sizeof(__pyx_k__wtf), 0, 0, 1, 1},
>>>>   {0, 0, 0, 0, 0, 0, 0}
>>>> };
>>>
>>> Both Python object variables should have different cnames.
>>
>> What's about adding encoding suffix?
>
> Yes, I think that would fix it, although it could be a bit misleading when
> reading the C code with a Py3 context in mind. But using a counter doesn't
> make it very readable, either.
>

Ok.

I've fixed it here https://github.com/vitek/cython/compare/file_encoding_T770

Now it produces the following identifiers:

static PyObject *__pyx_n_s__wtf;
static PyObject *__pyx_n_s_koi8r__wtf;

-- 
vitja.