[Python-Dev] Generalised String Coercion

"Martin v. Löwis" martin at v.loewis.de
Sun Aug 7 15:06:27 CEST 2005


Reinhold Birkenfeld wrote:
> FWIW, I've already drafted a patch for the former. It lets you write to
> file.encoding and honors this when writing Unicode strings to it.

I don't like that approach. You shouldn't be allowed to change the
encoding mid-stream (except perhaps under very specific circumstances).

As I see it, the buffer of an encoded file becomes split, atleast for
input: there are bytes which have been read and not yet decoded, and
there are characters which have been decoded but not yet consumed.
If you change the encoding mid-stream, you would have to undo decoding
that was already done, resetting the stream to the real "current"
position.

For output, the situation is similar: before changing to a new encoding,
or before changing from unicode output to byte output, you have to
flush then codec first: it may be that the codec has buffered some
state which needs to be completely processed first before a new codec
can be applied to the stream.

Another issue is seeking: given the many different kinds of buffers,
seeking becomes fairly complex. Ideally, seeking should apply to
application-level positions, ie. if when you tell the current position,
it should be in terms of data already consumed by the application.
Perhaps seeking in an encoded stream should not be supported at all.

Finally, you also have to consider Universal Newlines: you can apply
them either on the byte stream, or on the character stream. I think
conceptually right would be to do universal newlines on the character
stream.

Regards,
Martin


More information about the Python-Dev mailing list