[Python-Dev] bytes type discussion

"Martin v. Löwis" martin at v.loewis.de
Wed Feb 15 09:14:37 CET 2006


Greg Ewing wrote:
> If the protocol has been sensibly designed, that shouldn't
> happen, since everything up to the coding marker should
> be ascii (or some other protocol-defined initial coding).

XML, for one protocol, requires you to restart over. The
initial sequence could be UTF-16, or it could be EBCDIC.
You read a few bytes (up to four), then know which of
these it is. Then you start over, reading further if
it looks like an ASCII superset, to find out the real
encoding. You normally then start over, although switching
at that point could also work.

> For protocols that are not sensibly designed (or if you're
> just trying to guess) what you suggest may be needed. But
> it would be good to have a nicer way of going about it
> for when the protocol is sensible.

There might be buffering of decoded strings already,
(ie. beyond the point to which you have read), so
you would need to unbuffer these, and reinterpret
them. To support that, you really need to buffer
both the original bytes, and the decoded ones, since
the encoding might not roundtrip.

Regards,
Martin


More information about the Python-Dev mailing list