FSR and unicode compliance - was Re: RE Module Performance

Michael Torrie torriem at gmail.com
Sun Jul 28 11:52:47 EDT 2013


On 07/27/2013 12:21 PM, wxjmfauth at gmail.com wrote:
> Good point. FSR, nice tool for those who wish to teach
> Unicode. It is not every day, one has such an opportunity.

I had a long e-mail composed, but decided to chop it down, but still too
long.  so I ditched a lot of the context, which jmf also seems to do.
Apologies.

1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
is an official encoding.  FSR only differs from UTF-32 in that the
padding zeros are stripped off such that it is stored in the most
compact form that can handle all the characters in string, which is
always known at string creation time.  Now you can argue many things,
but to say FSR is not unicode compliant is quite a stretch!  What
unicode entities or characters cannot be stored in strings using FSR?
What sequences of bytes in FSR result in invalid Unicode entities?

2. strings in Python *never change*.  They are immutable.  The +
operator always copies strings character by character into a new string
object, even if Python had used UTF-8 internally.  If you're doing a lot
of string concatenations, perhaps you're using the wrong data type.  A
byte buffer might be better for you, where you can stuff utf-8 sequences
into it to your heart's content.

3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow, and that's unacceptable for
the use cases of python strings.  I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
variable-width encoding are always O(n).  A lot slower!

4. Unicode is, well, unicode.  You seem to hop all over the place from
talking about code points to bytes to bits, using them all
interchangeably.  And now you seem to be claiming that a particular byte
encoding standard is by definition unicode (UTF-8).  Or at least that's
how it sounds.  And also claim FSR is not compliant with unicode
standards, which appears to me to be completely false.

Is my understanding of these things wrong?



More information about the Python-list mailing list