[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Mon Mar 7 03:46:23 EST 2016

Hi hubo.

I think you're slightly confusing two things.

UTF-16 is a variable-length encoding that has two-word characters that
*has to* return "1" for len() of those. UCS-2 seems closer to what you
described (which is a fixed-width encoding), but can't encode all the
unicode characters and as such is unsuitable for a modern unicode
representation.

I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the
slicing and size calculations still has to be as complicated as for
UTF-8.

Complicated logic in repr() - those are not usually performance
critical parts of your program and it's ok to have some complications
there.

It's true that UTF-16 can be less efficient than UTF-8 for certain
languages, however both are more memory efficient than what we
currently use (UCS4). There are however some problems - even if you
work exclusively in, say, korean, for example web servers still have
to deal with some parts that are ascii (html markup, css etc.) while
handling text in korean. In those cases UTF8 vs UTF16 is more muddled
and the exact details depend a lot. We also need to consider the fact
that we ship one canonical PyPy to everybody - people using different
languages and different encodings.

Overall, UTF8 seems like definitely a better alternative than UCS4
(also for asian languages), which is what we are using now and I would
be inclined to leave UTF16 as an option to see if it performs better
for certain benchmarks.

Best regards,
Maciej Fijalkowski

On Mon, Mar 7, 2016 at 9:58 AM, hubo <hubo at jiedaibao.com> wrote:
> I think it is not reasonable to use UTF-8 to represent the unicode string
> type.
>
>
> 1. Less storage - this is not always true. It is only true for strings with
> a lot of ASCII characters. In Asia, most strings in local languages
> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more
> storage than in UTF-16. To make things worse, while it always consumes 2*N
> bytes for a N-characters string in UTF-16, it is difficult to estimate the
> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes)
> (UTF-16 also has two-word characters, but len() reports 2 for these
> characters, I think it is not harmful to treat them as two characters)
>
> 2. There would be very complicate logics for size calculating and slicing.
> For UTF-16, every character is represented with a 16-bit integer, so it is
> convient for size calculating and slicing. But character in UTF-8 consumes
> variant bytes, so either we call mb_* string functions instead (which is
> slow in nature) or we use special logic like storing indices of characters
> in another array (which introduces cost for extra addressings).
>
> 3. When displaying with repr(), non-ASCII characters are displayed with
> \uXXXX format. If the internal storage for unicode is UTF-8, the only way to
> be compatible with this format is to convert it back to UTF-16.
>
> It may be wiser to let programmers deside which encoding they would like to
> use. If they want to process UTF-8 strings without performance cost on
> converting, they should use "bytes". When correct size calculating and
> slicing of non-ASCII characters are concerned it may be better to use
> "unicode".
>
> 2016-03-07
> ________________________________
> hubo
> ________________________________
>
> 发件人：Armin Rigo <arigo at tunes.org>
> 发送时间：2016-03-05 16:09
> 主题：Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
> 收件人："Piotr Jurkiewicz"<piotr.jerzy.jurkiewicz at gmail.com>
> 抄送："PyPy Developer Mailing List"<pypy-dev at python.org>
>
> Hi Piotr,
>
> Thanks for giving some serious thoughts to the utf8-stored unicode
> string proposal!
>
> On 5 March 2016 at 01:48, Piotr Jurkiewicz
> <piotr.jerzy.jurkiewicz at gmail.com> wrote:
>>     Random access would be as follows:
>>
>>         page_num, byte_in_page = divmod(codepoint_pos, 64)
>>         page_start_byte = index[page_num]
>>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
>>         return buffer[exact_byte]
>
> This is the part I'm least sure about: seek_forward() needs to be a
> loop over 0 to 63 codepoints.  True, each loop can be branchless, and
> very short---let's say 4 instructions.  But it still makes a total of
> up to 252 instructions (plus the checks to know if we must go on).
> These instructions are all or almost all dependent on the previous
> one: you must have finished computing the length of one sequence to
> even being computing the length of the next one.  Maybe it's faster to
> use a more "XMM-izable" algorithm which counts 0 for each byte in
> 0x80-0xBF and 1 otherwise, and makes the sum.
>
> There are also variants, e.g. adding a second array of words similar
> to 'index', but where each word is 8 packed bytes giving 8 starting
> points inside the page (each in range 0-252).  This would reduce the
> walk to 0-7 codepoints.
>
> I'm +1 on your proposal. The whole thing is definitely worth a try.
>
>
> A bientôt,
>
> Armin.
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
>
>
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
>