[I18n-sig] Pre-PEP: Proposed Python Character Model

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 8 Feb 2001 20:58:20 +0100


> print u"hello world"
> 
> rather than the easier
> 
> print "hello world"
> 
> even though the message is clearly text.

You can easily have the latter being Unicode by invoking Python with
the -U option. If the pragma PEP is ever implemented, one pragma
should be reserved to declare the source file encoding, and another
one to declare all strings as Unicode in this file.

> I think we agree that, eventually, we would like the simple notation
> for a string literal to create a unicode string. What Im not sure
> about is whether we can make that change soon. How often are string
> literals used to create what is logically just binary data?

Let's have a look. Excluding __doc__ strings (which can be recognized
syntactically), performing grep '"' in the Python library, I get

BaseHTTPServer.py:__version__ = "0.2"
BaseHTTPServer.py:__all__ = ["HTTPServer", "BaseHTTPRequestHandler"] 

Both are "protocol" in some sense, i.e. not meant to be
human-readable. +2 for binary data

BaseHTTPServer.py:DEFAULT_ERROR_MESSAGE = """\ 

This is text, giving +1 for binary data. Actually, it is HTML, so when
transferring it, it needs to be encoded in some encoding; so it
*could* be considered as the encoded message instead

BaseHTTPServer.py:    sys_version = "Python/" + string.split(sys.version)[0]
BaseHTTPServer.py:    server_version = "BaseHTTP/" + __version__ 
BaseHTTPServer.py:        self.request_version = version = "HTTP/0.9" # Default BaseHTTPServer.py:                self.send_error(400, "Bad request version (%s)BaseHTTPServer.py:                                "Bad HTTP/0.9 request type (%s 
BaseHTTPServer.py:            self.send_error(400, "Bad request syntax (%s)" % `
BaseHTTPServer.py:            self.send_error(501, "Unsupported method (%s)" % `

Part of the HTTP protocol, thus binary data. +9

BaseHTTPServer.py:        self.log_error("code %d, message %s", code, message) 

Log file; this is text, so +8

            self.wfile.write("%s %s %s\r\n" %

HTTP protocol, +9

There are a few more. In total, BaseHTTPServer.py contains more binary
strings than text strings.

For other files, the ratio may vary. In general, I believe "binary"
strings in source code, as many of the strings are typically processed
by some other program which expects a specific byte sequence, rather
than a character string. 

Human-readable strings or probably more common in GUI
applications. One should think about i18n here, which means that the
actual localized message catalogs must be separate from the program
logic.

Regards,
Martin