[Python-Dev] Decoding incomplete unicode

M.-A. Lemburg mal at egenix.com
Tue Aug 24 11:36:45 CEST 2004


Martin v. Löwis wrote:
> Walter Dörwald wrote:
> 
>> OK, let's come up with a patch that fixes the incomplete byte
>> sequences problem and then discuss non-stream APIs.
>>
>> So, what should the next step be?
> 
> I think your first patch should be taken as a basis for that.

We do need a way to communicate state between the codec
and Python.

However, I don't like the way that the patch
implements this state handling: I think we should use a
generic "state" object here which is passed to the stateful
codec and returned together with the standard return values
on output:

def decode_stateful(data, state=None):
     ... decode and modify state ...
     return (decoded_data, length_consumed, state)

where the object type and contents of the state variable
is defined per codec (e.g. could be a tuple, just a single
integer or some other special object).

Otherwise we'll end up having different interface
signatures for all codecs and extending them to accomodate
for future enhancement will become unfeasable without
introducing yet another set of APIs.

Let's discuss this some more and implement it for Python 2.5.
For Python 2.4, I think we can get away with what we already
have:

If we leave out the UTF-7 codec changes in the
patch, the only state that the UTF-8 and UTF-16
codecs create is the number of bytes consumed. We already
have the required state parameter for this in the
standard decode API, so no extra APIs are needed for
these two codecs.

So the patch boils down to adding a few new C APIs
and using the consumed parameter in the standard
_codecs module APIs instead of just defaulting to the
input size (we don't need any new APIs in _codecs).

> Add the state-supporting decoding C functions, and change
> the stream readers to use them.

The buffer logic should only be used for streams
that do not support the interface to push back already
read bytes (e.g. .unread()).

 From a design perspective, keeping read data inside the
codec is the wrong thing to do, simply because it leaves
the input stream in an undefined state in case of an error
and there's no way to correlate the stream's read position
to the location of the error.

With a pushback method on the stream, all the stream
data will be stored on the stream, not the codec, so
the above would no longer be a problem.

However, we can always add the .unread() support to the
stream codecs at a later stage, so it's probably ok
to default to the buffer logic for Python 2.4.

> That still leaves the issue
> of the last read operation, which I'm tempted to leave unresolved
> for Python 2.4. No matter what the solution is, it would likely
> require changes to all codecs, which is not good.

We could have a method on the codec which checks whether
the codec buffer or the stream still has pending data
left. Using this method is an application scope consideration,
not a codec issue.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 24 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list