[Python-Dev] Codecs and StreamCodecs

M.-A. Lemburg mal@lemburg.com
Fri, 19 Nov 1999 10:56:03 +0100


Guido van Rossum wrote:
> 
> > Like a path of search functions ? Not a bad idea... I will still
> > want the internal dict for caching purposes though. I'm not sure
> > how often these encodings will be, but even a few hundred function
> > call will slow down the Unicode implementation quite a bit.
> 
> Of course.  (It's like sys.modules caching the results of an import).

I've fixed the "path of search functions" approach in the latest
version of the spec.
 
> [...]
> >     def flush(self):
> >
> >       """ Flushed the codec buffers used for keeping state.
> >
> >           Returns values are not defined. Implementations are free to
> >           return None, raise an exception (in case there is pending
> >           data in the buffers which could not be decoded) or
> >           return any remaining data from the state buffers used.
> >
> >       """
> 
> I don't know where this came from, but a flush() should work like
> flush() on a file. 

It came from Fredrik's proposal.

> It doesn't return a value, it just sends any
> remaining data to the underlying stream (for output).  For input it
> shouldn't be supported at all.
> 
> The idea is that flush() should do the same to the encoder state that
> close() followed by a reopen() would do.  Well, more or less.  But if
> the process were to be killed right after a flush(), the data written
> to disk should be a complete encoding, and not have a lingering shift
> state.

Ok. I've modified the API as follows:

StreamWriter:
    def flush(self):

	""" Flushes and resets the codec buffers used for keeping state.

	    Calling this method should ensure that the data on the
	    output is put into a clean state, that allows appending
	    of new fresh data without having to rescan the whole
	    stream to recover state.

	"""
	pass

StreamReader:
    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

	    The method should use a greedy read strategy meaning that
	    it should read as much data as is allowed within the
	    definition of the encoding and the given chunksize, e.g.
            if optional encoding endings or state markers are
	    available on the stream, these should be read too.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def reset(self):

	""" Resets the codec buffers used for keeping state.

	    Note that no stream repositioning should take place.
	    This method is primarely intended to recover from
	    decoding errors.

	"""
	pass

The .reset() method replaces the .flush() method on StreamReaders.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/