[Python-Dev] Internal representation of strings and Micropython

Thu Jun 5 20:48:45 CEST 2014

On 6/5/2014 3:10 AM, Paul Sokolovsky wrote:
> Hello,
>
> On Wed, 04 Jun 2014 22:15:30 -0400
> Terry Reedy <tjreedy at udel.edu> wrote:
>
>> think you are again batting at a strawman. If you mean 'read from a
>> file', and all you want to do is read bytes from and write bytes to
>> external 'files', then there is obviously no need to transcode and
>> neither Python 2 or 3 make you do so.
> But most files, network protocols are text-based, and I (and many other
> people) don't want to artificially use "binary data" type for them,
> with all attached funny things, like "b" prefix. And then Python2
> indeed doesn't transcode anything, and Python3 does, without being
> asked, and for no good purpose, because in most cases, Input data will
> be Output as-is (maybe in byte-boundary-split chunks).
>
> So, it all goes in rounds - ignoring the forced-Unicode problem (after a
> week of subscription to python-list, half of traffic there appear to be
> dedicated to Unicode-related flames) on python-dev behalf is not
> going to help (Python community).

If all your program is doing is reading and writing data (input data 
will be output as-is), then use of binary doesn't require "b" prefix, 
because you aren't manipulating the data. Then you have no unnecessary 
transcoding.

If you actually wish to examine or manipulate the content as it flows 
by, then there are choices.

1) If you need to examine/manipulate only a small fraction of text data 
with the file, you can pay the small price of a few "b" prefixes to get 
high performance, and explicitly transcode only the portions that need 
to be manipulated.

2) If you are examining the bulk of the data as it flows by, but not 
manipulating it, just examining/extracting, then a full transcoding may 
be useful for that purpose... but you can perhaps do it explicitly, so 
that you keep the binary form for I/O. Careful of the block boundaries, 
in this case, however.

3) If you are actually manipulating the bulk of the data, then the 
double transcoding (once on input, and once on output) allows you to 
work in units of codepoints, rather than bytes, which generally makes 
the manipulation algorithms easier.

4) If you truly cannot afford the processor code of the double 
transcoding, and need to do all your manipulations at the byte level, 
then you could avoid the need for "b" prefix by use of a preprocessor 
for those sections of code that are doing all and only bytes 
processing... and you'll have lots of arcane, error-prone code to write 
to manipulate the bytes rather than the codepoints.

On the other hand, if you can convince your data sources and sinks to 
deal in UTF-8, and implement a UTF-8 str in μPy, then you can both avoid 
transcoding, and make the arcane algorithms part of the implementation 
of μPy rather than of the application code, and support full Unicode. 
And it seems to me that the world is moving that way... towards UTF-8 as 
the standard interchange format. Encourage it.

Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140605/10aab2a6/attachment.html>