[I18n-sig] UTF-8 decoder in CVS still buggy

M.-A. Lemburg mal@lemburg.com
Mon, 24 Jul 2000 10:26:25 +0200


Walter Underwood wrote:
> 
> I'd rather that it not try to "repair" broken UTF-8. If it isn't UTF-8,
> throw an exception,
> and let the caller decide.

Note that we are talking about the "replace" error handling
case here. The default "strict" mode will throw an exception.
 
> For example, when parsing XML, invalide UTF-8 means the whole document is
> invalid.
> It is considered polite to say where the first invalid character occurs,
> but it is not
> acceptable to continue parsing. An XML parser cannot use a UTF-8 decoder
> that accepts
> invalide UTF-8.
> 
> Code that deals with multiple encodings usually needs to do some encoding
> guessing
> up front, before choosing an encoder. If the guess is wrong, I'd want the
> decoder to
> fail, so we can try the next most likely endcoding.
> 
> We're busy converting our search engine to use Unicode, so I'm really
> familiar with
> the issues right now.

Please keep us informed of any quirks you may experience
during this conversion. We can use some real life reports for
the new Unicode support in Python to polish up the implementation
and design.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/