[Python-Dev] Codecs and StreamCodecs

Thu, 18 Nov 1999 11:59:06 -0500

[Responding to some lingering mails]

[/F]
> >     >>> u = unicode("å i åa ä e ö", "iso-latin-1")
> >     >>> s = u.encode("html-entities")
> >     >>> d = decoder("html-entities")
> >     >>> d.decode(s[:-1])
> >     "å i åa ä e "
> >     >>> d.flush()
> >     "ö"

[MAL]
> Ah, ok. So the .flush() method checks for proper
> string endings and then either returns the remaining
> input or raises an error.

No, please.  See my previous post on flush().

> > input: read chunks of data, decode, and
> > keep extra data in a local buffer.
> > 
> > output: encode data into suitable chunks,
> > and write to the output stream (that's why
> > there's a buffersize argument to encode --
> > if someone writes a 10mb unicode string to
> > an encoded stream, python shouldn't allocate
> > an extra 10-30 megabytes just to be able to
> > encode the darn thing...)
> 
> So the stream codecs would be wrappers around the
> string codecs.

No -- the other way around.  Think of the stream encoder as a little
FSM engine that you feed with unicode characters and which sends bytes
to the backend stream.  When a unicode character comes in that
requires a particular shift state, and the FSM isn't in that shift
state, it emits the escape sequence to enter that shift state first.
It should use standard buffered writes to the output stream; i.e. one
call to feed the encoder could cause several calls to write() on the
output stream, or vice versa (if you fed the encoder a single
character it might keep it in its own buffer).  That's all up to the
codec implementation.

The flush() forces the FSM into the "neutral" shift state, possibly
writing an escape sequence to leave the current shift state, and
empties the internal buffer.

The string codec CONCEPTUALLY uses the stream codec to a cStringIO
object, using flush() to force the final output.  However the
implementation may take a shortcut.  For stateless encodings the
stream codec may call on the string codec, but that's all an
implementation issue.

For input, things are slightly different (you don't know how much
encoded data you must read to give you N Unicode characters, so you
may have to make a guess and hold on to some data that you read
unnecessarily -- either in encoded form or in Unicode form, at the
discretion of the implementation.  Using seek() on the input stream is
forbidden (it could be a pipe or socket).

--Guido van Rossum (home page: http://www.python.org/~guido/)