Most pythonic way to truncate unicode?

Thu May 28 11:58:35 EDT 2009

Andrew Fong wrote:

> I need to ...
> 
> 1) Truncate long unicode (UTF-8) strings based on their length in
> BYTES. For example, u'\u4000\u4001\u4002 abc' has a length of 7 but
> takes up 13 bytes. Since u'\u4000' takes up 3 bytes, I want truncate
> (u'\u4000\u4001\u4002 abc',3) == u'\u4000' -- as compared to
> u'\u4000\u4001\u4002 abc'[:3] == u'\u4000\u4001\u4002'.
> 
> 2) I don't want to accidentally chop any unicode characters in half.
> If the byte truncate length would normally cut a unicode character in
> 2, then I just want to drop the whole character, not leave an orphaned
> byte. So truncate(u'\u4000\u4001\u4002 abc',4) == u'\u4000' ... as
> opposed to getting UnicodeDecodeError.
> 
> I'm using Python2.6, so I have access to things like bytearray. Are
> there any built-in ways to do something like this already? Or do I
> just have to iterate over the unicode string?

How about

>>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore")
u'\xe4\xf6'
>>> print _
äö

Peter