Most pythonic way to truncate unicode?

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Thu May 28 20:49:37 EDT 2009


On Thu, 28 May 2009 08:50:00 -0700, Andrew Fong wrote:

> I need to ...
> 
> 1) Truncate long unicode (UTF-8) strings based on their length in BYTES.

Out of curiosity, why do you need to do this?


> For example, u'\u4000\u4001\u4002 abc' has a length of 7 but takes up 13
> bytes. 

No, that's wrong. The number of bytes depends on the encoding, it's not a 
property of the unicode string itself.

>>> s = u'\u4000\u4001\u4002 abc'
>>> len(s)  # characters
7
>>> len(s.encode('utf-8'))  # bytes
13
>>> len(s.encode('utf-16'))  # bytes
16
>>> len(s.encode('U32'))  # bytes
32


> Since u'\u4000' takes up 3 bytes

But it doesn't. The *encoded* unicode character *may* take up three 
bytes, or four, or possibly more, depending on what encoding you use.


-- 
Steven



More information about the Python-list mailing list