Chardet, file, ... and the Flexible String Representation

Fri Sep 6 12:59:08 EDT 2013

On Fri, Sep 6, 2013, at 11:46, Piet van Oostrum wrote:
> The FSR does not split unicode in chuncks. It does not create problems
> and therefore it doesn't have to solve this. 
> 
> The FSR simply stores a Unicode string as an array[*] of ints (the
> Unicode code points of the characters of the string. That's it. Then it
> uses a memory-efficient way to store this array of ints. But that has
> nothing to do with character sets. The same principle could be used for
> any array of ints.

I think the source of the confusion is that it is described in terms of
UCS-2 and Latin-1, which people often think of (especially latin-1) as
different encodings rather than merely storing code points in a narrower
type.

----

Incidentally, how does all this interact with ctypes unicode_buffers,
which slice as strings and must be UTF-16 on windows? This was fine
pre-FSR when unicode objects were UTF-16, but I'm not sure how it would
work now.