[I18n-sig] XML and codecs

M.-A. Lemburg mal@lemburg.com
Tue, 05 Jun 2001 20:01:52 +0200


"Martin v. Loewis" wrote:
> 
> > How would UTF-16 be handled? I guess without additional
> > code multiple BOMs would be generated for a string that
> > contains unencodable characters.
> 
> When you generate or decode UTF-16, this is not a problem: There won't
> be any unencodable characters.
> 
> Even if that was a problem: Just by raising the exception, there won't
> be multiple BOMs. So you have to provide additional code, anyway, so
> you better make sure this code is correct.
> 
> The problem becomes real for codecs that preserve state: You'll need
> to maintain the state of the codec from the time the exception
> occurred, so that subsequence .encode calls will continue in the shift
> state they were in previously.

Should be no problem since the exception will sort of freeze
the current state of the codec (provided it's a StreamWriter/Reader)
and let you use this state to take appropriate actions.
 
> So for codecs that preserve state across .encode calls, codecs.lookup
> will need to return a bound method as encode and decode function, not
> a simple function; see the iconv codec for an example.

Not sure what you mean here, but the encoder and decoder
returned by codecs.lookup() must not maintain state. This
property is reserved for StreamWriters and Readers (see the
Unicode docs).
 
> In some sense, one can argue that the UTF-16 Codec also preserves
> state: whether it has yet emitted a BOM.

BTW, I haven't yet had time to check your utf16 patch but from
a first glance it looks good.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/