Validate string as UTF-8?

Diez B. Roggisch deets at nospam.web.de
Sun Nov 6 15:10:55 EST 2005


Tony Nelson wrote:
> I'd like to have a fast way to validate large amounts of string data as 
> being UTF-8.
> 
> I don't see a fast way to do it in Python, though:
> 
>     unicode(s,'utf-8').encode('utf-8)
> 
> seems to notice at least some of the time (the unicode() part works but 
> the encode() part bombs).  I don't consider a RE based solution to be 
> fast.  GLib provides a routine to do this, and I am using GTK so it's 
> included in there somewhere, but I don't see a way to call GLib 
> routines.  I don't want to write another extension module.

I somehow doubt that the encode bombs. Can you provide some more 
details? Maybe of some allegedly not working strings?

Besides that, it's unneccessary - the unicode(s, "utf-8") should be 
sufficient. If there are any undecodable byte sequences in there, that 
should find them.

Regards,

Diez



More information about the Python-list mailing list