[Python-Dev] Decoding incomplete unicode

M.-A. Lemburg mal at egenix.com
Wed Aug 25 10:41:58 CEST 2004


Walter Dörwald wrote:
> M.-A. Lemburg wrote:
> 
>> Martin v. Löwis wrote:
>>
>>> Walter Dörwald wrote:
>>>
>>>> OK, let's come up with a patch that fixes the incomplete byte
>>>> sequences problem and then discuss non-stream APIs.
>>>>
>>>> So, what should the next step be?
>>>
>>>
>>> I think your first patch should be taken as a basis for that.
>>
>>
>> We do need a way to communicate state between the codec
>> and Python.
>>
>> However, I don't like the way that the patch
>> implements this state handling: I think we should use a
>> generic "state" object here which is passed to the stateful
>> codec and returned together with the standard return values
>> on output:
>>
>> def decode_stateful(data, state=None):
>>     ... decode and modify state ...
>>     return (decoded_data, length_consumed, state)
> 
> Another option might be that the decode function changes
> the state object in place.

Good idea.

>> where the object type and contents of the state variable
>> is defined per codec (e.g. could be a tuple, just a single
>> integer or some other special object).
> 
> If a tuple is passed and returned this makes it possible
> from Python code to mangle the state. IMHO this should be
> avoided if possible.
 >
>> Otherwise we'll end up having different interface
>> signatures for all codecs and extending them to accomodate
>> for future enhancement will become unfeasable without
>> introducing yet another set of APIs.
> 
> We already have slightly different decoding functions:
> utf_16_ex_decode() takes additional arguments.

Right - it was a step in the wrong direction. Let's not
use a different path for the future.

>> Let's discuss this some more and implement it for Python 2.5.
>> For Python 2.4, I think we can get away with what we already
>> have:
> 
>  > [...]
> 
> OK, I've updated the patch.
> 
>> [...]
>> The buffer logic should only be used for streams
>> that do not support the interface to push back already
>> read bytes (e.g. .unread()).
>>
>>  From a design perspective, keeping read data inside the
>> codec is the wrong thing to do, simply because it leaves
>> the input stream in an undefined state in case of an error
>> and there's no way to correlate the stream's read position
>> to the location of the error.
>>
>> With a pushback method on the stream, all the stream
>> data will be stored on the stream, not the codec, so
>> the above would no longer be a problem.
> 
> On the other hand this requires special stream. Data
> already read is part of the codec state, so why not
> put it into the codec?

Ideally, the codec should not store data, but only
reference it. It's better to keep things well
separated which is why I think we need the .unread()
interface and eventually a queue interface to support
the feeding operation.

>> However, we can always add the .unread() support to the
>> stream codecs at a later stage, so it's probably ok
>> to default to the buffer logic for Python 2.4.
> 
> OK.
> 
>>> That still leaves the issue
>>> of the last read operation, which I'm tempted to leave unresolved
>>> for Python 2.4. No matter what the solution is, it would likely
>>> require changes to all codecs, which is not good.
>>
>>
>> We could have a method on the codec which checks whether
>> the codec buffer or the stream still has pending data
>> left. Using this method is an application scope consideration,
>> not a codec issue.
> 
> But this mean that the normal error handling can't be used
> for those trailing bytes.

Right, but then: missing data (which usually causes the trailing
bytes) is really something for the application to deal with,
e.g. by requesting more data from the user, another application
or trying to work around the problem in some way. I don't think
that the codec error handler can practically cover these
possibilities.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 25 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list