[I18n-sig] UTF-8 decoder in CVS still buggy

Walter Underwood wunder@ultraseek.com
Sun, 23 Jul 2000 16:28:56 -0700


--On Sunday, July 23, 2000 10:40 PM +0200 Florian Weimer <fw@deneb.enyo.de> 
wrote:
>
> And your search engine stops processing a document as soon as it
> encounters an invalid UTF-8 sequence even though the majority of it is
> valid UTF-8?  I don't think so.

Actually, since it is likely to have errors and not be readable in an 
application,
tossing it could be the best choice. Showing people hits that they can't 
read
is not very polite.

But we do try harder than that. It falls back to a different character set. 

Eventually, it ends up in a very liberal character set, like windows-1252,
where almost all 8-bit values are legal.

We do a similar thing with XML -- if it fails the parse, we try it as HTML,
and our HTML parser will take almost anything.

But back to the subject, I'm not sure that repairing invalid UTF-8 is a good
idea. The HTML experience is that it is a really bad idea to accept invalid
documents. If it is necessary, we might want to call it something different
than a decoder.

wunder
--
Walter Underwood
Senior Staff Engineer, Ultraseek Server, Inktomi Corp.
http://www.ultraseek.com/
http://www.inktomi.com/