Validate string as UTF-8?

Tony Nelson *firstname*nlsnews at georgea*lastname*.com
Sun Nov 6 15:47:39 EST 2005


In article <mailman.176.1131307306.18701.python-list at python.org>,
 "Fredrik Lundh" <fredrik at pythonware.com> wrote:

> Tony Nelson wrote:
> 
> > I'd like to have a fast way to validate large amounts of string data as
> > being UTF-8.
> 
> define "validate".

All data conforms to the UTF-8 encoding format.  I can stand if someone 
has made data that impersonates UTF-8 that isn't really Unicode.


> > I don't see a fast way to do it in Python, though:
> >
> >     unicode(s,'utf-8').encode('utf-8)
> 
> if "validate" means "make sure the byte stream doesn't use invalid
> sequences", a plain
> 
>     unicode(s, "utf-8")
> 
> should be sufficient.

You are correct.  I misunderstood what was happening in my code.  I 
apologise for wasting bandwidth and your time (and I wasted my own time 
as well).

Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough 
for my purpose, adding about 25% to the time to load a file.
________________________________________________________________________
TonyN.:'                        *firstname*nlsnews at georgea*lastname*.com
      '                                  <http://www.georgeanelson.com/>



More information about the Python-list mailing list