codecs.getencoder encodes entire string ?

"Martin v. Löwis" martin at v.loewis.de
Thu Jul 28 16:15:58 EDT 2005


nicolas_riesch wrote:

> I just don't understand why it returns the "length consumed".
> 
> Does it means that in some case, the input string can be only partially
> converted ?

For an encoder, I believe the answer is "no". For a decoder, it is
a definite yes: if the input does not end with a complete character,
you may have bytes left at the end which did not get decoded.

For an encoder, the same *might* happen if you want to encode
half-surrogates into, say, UTF-8; the encoder might refuse to
encode the half-surrogate, and wait for the other half. Of course,
the current UTF-8 encoder will then just encode the surrogate
codepoint as if it was a proper character.

If you extend the notion of "encoding", similar things may happen
all the time. E.g. a DES encoder may only support multiples of
the block size, and leave bytes at the end.

> What can be the use of the "length consumed" value ?

It's primarily intended for stream writers, which may need
to buffer extra characters at the end that did not get encoded,
and wait until more input is provided.

For all practical purposes, you can ignore the length on
encoding. If you are paranoid, assert that it equals the
length of the input.

Regards,
Martin



More information about the Python-list mailing list