[Python-Dev] Codecs and StreamCodecs

Fredrik Lundh fredrik@pythonware.com
Wed, 17 Nov 1999 12:00:10 +0100

M.-A. Lemburg <mal@lemburg.com> wrote:
> >     def flush(self):
> >         # flush the decoding buffers.  this should usually
> >         # return None, unless the fact that knowing that the
> >         # input stream has ended means that the state can be
> >         # interpreted in a meaningful way.  however, if the
> >         # state indicates that there last character was not
> >         # finished, this method should raise a UnicodeError
> >         # exception.
> Could you explain for reason for having a .flush() method
> and what it should return.

in most cases, it should either return None, or
raise a UnicodeError exception:

    >>> u = unicode("å i åa ä e ö", "iso-latin-1")
    >>> # yes, that's a valid Swedish sentence ;-)
    >>> s = u.encode("utf-8")
    >>> d = decoder("utf-8")
    >>> d.decode(s[:-1])
    "å i åa ä e "
    >>> d.flush()
    UnicodeError: last character not complete

on the other hand, there are situations where it
might actually return a string.  consider a "HTML
entity decoder" which uses the following pattern
to match a character entity: "&\w+;?" (note that
the trailing semicolon is optional).

    >>> u = unicode("å i åa ä e ö", "iso-latin-1")
    >>> s = u.encode("html-entities")
    >>> d = decoder("html-entities")
    >>> d.decode(s[:-1])
    "å i åa ä e "
    >>> d.flush()

> Perhaps I'm missing something, but how would you define
> stream codecs using this interface ?

input: read chunks of data, decode, and
keep extra data in a local buffer.

output: encode data into suitable chunks,
and write to the output stream (that's why
there's a buffersize argument to encode --
if someone writes a 10mb unicode string to
an encoded stream, python shouldn't allocate
an extra 10-30 megabytes just to be able to
encode the darn thing...)

> > Implementing stream codecs is left as an exercise (see the zlib
> > material in the eff-bot guide for a decoder example).

everybody should have a copy of the eff-bot guide ;-)

(but alright, I plan to post a complete utf-8 implementation
in a not too distant future).
