[Python-Dev] Allowing u.encode() to return non-strings

Wed Jun 30 01:41:08 EDT 2004

Tim,

I'm not sure this needs to be on the list.  My major point, I guess,
is that the byte vectors we tend to call strings in Python have no
string-ness, as understood in the 21st century.  There is no character
set associated with them, which means that there is effectively no way
to look at the "next character" in a string (you don't know how long a
character is), no way to count the number of characters, etc.  The
documentation, particularly the language manual, is extremely
confusing on this point, in classifying "string" and "Unicode" objects
as the same sort of thing.  And then not documenting them clearly.

"struct.pack", for instance, doesn't really return a string -- it
returns a byte vector.

Unicode is really the only kind of *string* type that's supported,
which is problematic, as it's not integrated with the file streams
support.  For instance, how do I write a function that opens a file
containing text in some multi-byte format (which, we'll assume, I know
the name of -- perhaps from a content-type field), and reads the first
three characters of the text?  Can't.  That's because the "file"
constructor doesn't take an encoding, and "read" and "readline" don't
return Unicode objects.  I could try, by reading some bytes, then
using unicode to turn it into a string, then seeing how many
characters I read, but that's pretty imprecise.  I go round and round
the "codecs" module thinking that someone must have thought of this --
or maybe there's an optional argument to file() that make it return
real (Unicode) strings -- but no luck.

I find it hard to believe that I've dreamed up something that neither
you nor (especially) Martin have thought of till now.  But consider
this idea.

Any file that is not explicitly opened as binary (with the 'b' flag
(and, by the way, why isn't the 'b' flag the default for file opening?
It would save a lot of grief dealing with Windows.)) should be
considered a text file, and it should have an associated "encoding"
attribute (as file objects already do), which would also be a keyword
parameter to the constructor.  The default would be
sys.getdefaultencoding().  The "size" parameter to the methods "read"
and "readline" should refer to characters, not bytes, for text files.
The return values from "next", "read" and "readline" would be Unicode
objects for text files.  Similarly, the methods "write" and
"writelines" should, for text files, take Unicode objects and raise an
exception if fed a "byte vector".

I'd go further.  I'd introduce the notation

    v = b"abc"

which means that "v" has assigned to it an 8-bit "string" byte vector.
Then, after a release or two, I'd make plain old

    "foo"

mean what

    u"foo"

means today, so that string literals are by default Unicode (module
PEP 263).

Bill