Most pythonic way to truncate unicode?

John Machin sjmachin at lexicon.net
Fri May 29 00:09:53 EDT 2009


John Machin <sjmachin <at> lexicon.net> writes:

> Andrew Fong <FongAndrew <at> gmail.com> writes:

 > Are
> > there any built-in ways to do something like this already? Or do I
> > just have to iterate over the unicode string?
> 
> Converting each character to utf8 and checking the
> total number of bytes so far?
> Ooooh, sloooowwwwww!
> 

Somewhat faster:

u8len = 0
for u in unicode_string:
   if u <= u'\u007f':
      u8len += 1
   elif u <= u'\u07ff':
      u8len += 2
   elif u <= u'\uffff':
      u8len += 3
   else:
      u8len += 4

Cheers,
John




More information about the Python-list mailing list