[C++-sig] Some thoughts on py3k support

Wed Apr 8 17:22:11 CEST 2009

on Wed Mar 18 2009, "Niall Douglas" <s_sourceforge-AT-nedprod.com> wrote:

> On 18 Mar 2009 at 2:07, Haoyu Bai wrote:
>
>> According to the current behavior of Boost.Python converters, the
>> wrapped function in Python 3 will return a b"Hello" (which is a bytes
>> object but not a string). So code like this will broken:
>> 
>> if "Hello" == hello(): ...
>> 
>> Because string object "Hello" is not equal to bytes object b"Hello"
>> returned by hello(). We may change the behavior of converter to return
>> a unicode string in Python 3, that would keep most of existing code
>> compatible. Anyway there will be code really need a single byte string
>> returned, a new converter can be explicitly specified for this.
>
> One shouldn't be doing such a comparison anyway IMHO, though the 
> idiom of equivalence between C++ immutable strings and python 
> immutable strings is long-standing. 

Well, that sort of thing comes up in the test suites, so we at least
need to adjust.

> Also, we need to fix booleans not working quite properly in BPL.

Good one.

>> There are more issues similar to this. I'll figure out more and write
>> a detailed proposal as soon as possible.
>
> I have my own opinion on unicode and judging by the other posts, I'll 
> be disagreeing with just about everyone else.
>
> Firstly, I'd like to state that Python v3 has ditched the old string 
> system for very good reasons and this change more than any other has 
> created source incompatibilites in most code. One cannot expect much 
> difference in Boost.Python - code *should* need to be explicitly 
> ported.

I'd rather say, "we shouldn't bend over backwards to maintain 100%
backward compatibility because it's impossible."  It should still be a
goal to avoid forcing the authors of bindings to make changes where it's
reasonable to do so.

> Much like the open() function with text (and reusing its machinery), 
> I propose you need to specify the *default* encoding for immutable 
> const char * though it defaults from LC_LANG in most cases to UTF-8 - 
> that's right, const char * will be UTF-8 by default though it's 
> overridable. 

Are you suggesting that when converting to Python string, char const*
should be interpreted by Boost.Python as UTF-8?  I think that's
pretty reasonable.  Any bytes found in a C++ string that fall outside of
7-bit ASCII aren't portable anyway, are they?

> This default encoding should be a per-python interpreter 
> setting (i.e. it uses whatever open() uses) though it can be 
> temporarily overridden using a specifier template.

Now I"m a little confused about what you mean.

> const unsigned char * looks better to me for immutable byte data

Yes, if it's really just a bag of bytes and not text, I agree.

>  - I agree that some compilers have the option for char * == unsigned
> char *, but these are rare and it's an unwise option in most cases.

No, they're always different types.  The option that some compilers
support is that char can be an unsigned or a signed type.  That's a big
difference, and it means Haoyu doesn't need to worry about it.

> std::vector<unsigned char> is definitely the fellow for mutable byte 
> data. std::vector<char> smells too much like a mutable string.

I don't see what practical impact that should have on this project.

> I appreciate that const char * defaulting to UTF-8 might be 
> controversial - after all, ISO C++ has remained very agnostic about 
> the matter (see http://www.open-
> std.org/jtc1/sc22/wg21/docs/papers/2007/n2442.htm for a proposed 
> slight change). I have rationales though:
>
> 1. While unicode in C++ source anywhere other than in string literals
> is a very bad idea, both MSVC and GCC support having the entire source
> file in UTF-8 text and unicode characters in string literals being
> incorporated as-is. This method of incorporating non-European letters
> into string literals is too easy to avoid using ;). It's certainly
> very common in Asia and until the u8 literal specifier is added to C++
> there isn't an easy workaround. It wouldn't be very open minded of us
> to assume the world can still easily make do with ASCII- 7.

Makes sense.

> 2. A *lot* of library code e.g. GUI library code, most of Linux or 
> indeed the GNU C library, has moved to UTF-8 for its string literals. 
> Interfacing strings between BPL and this library code is made much 
> easier and natural if it takes UTF-8 as default.

ditto

> 3. "char *" in C and C++ means "a string" in the minds of most 
> programmers. 

char const* does.  What to do about char* is a separate question.

And don't forget char const (&)[N]

> Having it be a set of bytes might be standards correct 
> but let's be clear, C style strings have always had escape sequences 
> so they have never been entirely pure byte sequences. 

I don't know what that means.  Escape sequences disappear between the
source file and the in-memory representation of the string.

> Making this immutable bytes instead will make BPL unnatural to use -
> expect questions here on c++-sig :)

Agreed.

> 4. Chances are that world + dog is going to move to UTF-8 eventually 
> anyway and that means all C++ source code. Might as well make that 
> the least typing required scenario.
>
> Anyway, I expect few will agree with me, but that's my opinion.

I think you raised a number of irrelevant issues whose relevance I can't
see, but I agree with the substance of your argument.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com