[Python-3000] string C API

Fri Sep 15 17:15:27 CEST 2006

Jim Jewett wrote:
>> > ISTM that raising the exception lazily (which seems to be necessary)
>> > would be very confusing.
> 
>> Yeah, it appears it would be necessary to at least *scan* the string 
>> when it
>> was first created in order to ensure it can be decoded without errors 
>> later on.
> 
> What happens today with strings?  I think the answer is:
>     "Nothing.
>      They print something odd when printed.
>      They may raise errors when explicitly recoded to unicde."
> Why is this a problem?

We don't have 8-bit strings lying around in Py3k. To convert bytes to 
characters, they *must* be converted to unicode code points.

> I'm not so happy about the efficiency implication of the idea that
> *all* strings *must* be validated (let alone recoded).

Then always define latin-1 as the source encoding for your files - it will 
just pass the bytes straight through.

>> Since strings don't currently have any mutable internal state, it's 
>> possible
>> to freely share them between threads (without this property, the 
>> interning
>> behaviour would be doomed).
> 
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings.  It
> might make sense to intern only strings that are in the same encoding
> as the source code.  (Or whose values are limited to ASCII?)

Unicode strings don't have an encoding - they only store code points.

>> If strings could change the encoding of their internal buffers then 
>> they'd
>> have to use a read/write lock internally on all operations that might be
>> affected when the internal representation changes. Blech.
> 
> Why?
> 
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable.  Recoding that results
> in different bytes should not be in-place.  Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.
> 
> Anything keeping its own reference to the old databuffer (and old
> encoding) will continue to work, so immutability ==> the two buffers
> really are equivalent.

I admit that by using a separate Python object for the data buffer instead of 
a pointer to raw memory, the incref/decref in the processing code becomes the 
moral equivalent of a read lock, but consider the case where Thread A performs 
an operation and decides "I need to recode the buffer to UCS-4" at the same 
time that Thread B performs an operation and decides "I need to recode the 
buffer to UCS-4".

To deal with that you would still want to be very careful with the incref 
new/reassign/decref old step for switching in a new the data buffer (probably 
by using some form of atomic reassignment operation).

And this style has some very serious overhead implications, as each string 
would now require:
   The string object, with a 32 or 64 bit pointer to the data buffer object
   The data buffer object

String memory overhead would double, with an additional 32 or 64 bits 
depending on platform. This is a pretty significant increase when it comes to 
identifier-length strings.

So still blech, even if you make the data buffer a separate Python object to 
avoid the need for an actual read/write lock.

>> Sure certain applications that are just copying from one data stream to
>> another (both in the same encoding) may needlessly decode and then 
>> re-encode
>> the data,
> 
> Other than text editors, "certain" includes almost any application I
> have ever used, let alone written.

If you're reading text and you *know* it is ASCII data, then you can just set 
the encoding to latin-1 (since that can just copy the original bytes to the 
string's internal buffer - the actual ascii codec needs to check each byte to 
see whether or not the high bit is set, so it would be slower, and blow up 
with a DecodingError if the high bit was ever set).

I suspect an awful lot of quick-and-dirty scripts written by native English 
speakers will do exactly that.

>> but if the application *knows* that this might happen (and has
>> reason to care about optimising the performance of this case), then the
>> application is free to decouple the "reading" and "decoding" steps, 
>> and just
>> transfer raw bytes between the streams.
> 
> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe?  Not so good.

No, the standard recipe becomes "handle bytes as bytes and text as 
characters". If you know your source data is 8-bit text (or are happy to treat 
it that way, even if it isn't), then use the latin-1 codec to decode the 
original bytes directly to 8-bit characters.

Or just open the file in binary and read the data in as bytes instead of 
characters.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org