Of console I/O, characters, strings & dogs

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Mon Dec 29 20:40:36 EST 2008


On Mon, 29 Dec 2008 18:53:45 -0600, David Lemper wrote:

> I am trying getch() from msvcrt.  The following module has been run with
> 3 different concatination statements and none yield a satisfactory
> result.    Python 3.0


Your first problem is that you've run into the future of computing: 
strings are not bytes. For many decades, programmers have been pretending 
that they are, and you can get away with it so long as you ignore the 95% 
of the world that doesn't use English as their only language.

Unfortunately, the time for this is passed. Fortunately, it's (mostly) 
not difficult to use Unicode, and Python makes it easy for you.

> # script12
> import msvcrt
> shortstr1 = 'd' + 'o' + 'g'
> print(shortstr1)

In Python 3, shortstr1 is a Unicode string. Python 3 uses Unicode as it's 
string type. Because Python is doing all the hard work behind the scenes, 
you don't have to worry about it, and you can just print shortstr1 and 
everything will Just Work.


> char1 = msvcrt.getch()
> char2 = msvcrt.getch()
> char3 = msvcrt.getch()

I don't have msvcrt here but my guess it that getch is returning a *byte* 
rather than a *character*. In the Bad Old Days, all characters were 
bytes, and programmers pretended that they were identical. (This means 
you could only have 256 of them, and not everyone agreed what those 256 
of them were.)

But in Unicode, characters are characters, and there are thousands of 
them. MANY thousands. *Way* too many to store in a single byte.


>          <  alternatives for line 8 below  >
> print(shortstr2)
> 
>                print(shortstr1) gives    dog    of course. If the same
>                char are entered individually at the console,  as char1,
>                2 & 3, using msvcrt.getch(), I have not been able to get
>                out a plain dog.
> 
>        If line 8 is   shortstr2 = char1[0] + char2[0] + char3[0]
>             print(shortstr2)  yields     314


>>> ord('d') + ord('o') + ord('g')
314

The ordinal value of a byte is its numeric value, as a byte.



>        If line 8 is   shortstr2 = 'char1[0]' + 'char2[0]' + 'char3[0]'
>             print(shortstr2)  yields     char1[0]char2[0]char3[0]


Of course it does. You're setting shortstr2 equal to the literal strings 
'char1[0]' etc. But nice try: you want to convert each not-really-a-char 
to a (Unicode) string. You don't do that with the '' delimiters, as that 
makes a literal string, but with the str() function.

Either of these should work:

shortstr2 = str(char1) + str(char2) + str(char3)
shortstr2 = str(char1 + char2 + char3)

While they will work for (probably) any character you can type with your 
keyboard, they will probably fail to give sensible results as soon as you 
try using characters like £ © ë β 伎 

The right way to convert bytes to characters is to decode them, and to 
decode them, you have to know what encoding system is used. If the 
characters are plain-old English letters typed on an American keyboard, 
you can do this:

bytes = char1 + char2 + char3
shortstr2 = bytes.decode('ascii')

but this can give unexpected results if bytes contains non-ASCII values.

Better is to go the whole-hog and use the UTF-8 encoding, unless you 
specifically know to use something else:

shortstr2 = bytes.decode('utf-8')


>        If line 8 is   shortstr2 = char1 + char2 + char3
>             print(shortstr2)  yields    b 'dog'
>  
>                     Is the latter out of "How to Speak Redneck"  ?
>
>               Possibly b means bit string.

Nice guess, close but not quite. It actually means byte string.

You probably should read this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must 
Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

http://www.joelonsoftware.com/articles/Unicode.html


Good luck!



-- 
Steven



More information about the Python-list mailing list