[Python-Dev] UTF-8 Decoder

Antoine Pitrou solipsis at pitrou.net
Tue Apr 14 16:42:31 CEST 2009


Jeroen Ruigrok van der Werven <asmodai <at> in-nomine.org> writes:
> 
> This got posted on the Unicode list, does it seem interesting for Python
> itself, the UTF-8 to UTF-16 transcoding might be?
> 
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

If you have some time on your hands, you could try benchmarking it against
Python 3.1's (py3k) decoder. There are two cases to consider:
- mostly non-ASCII input, such as the "utf-8 demo" file mentioned in the page 
above
- mostly ASCII input, such as will happen very often (think HTML, XML, log
files, etc.)

The py3k utf-8 decoder is optimized for the latter.

Regards

Antoine.




More information about the Python-Dev mailing list