bad data from urllib when run from MS .bat file

Stuart McGraw smcg4191 at frii.RimoovThisToReply.com
Sat Sep 18 18:23:40 EDT 2004


I just spent a $*#@!*&^&% hour registering at ^$#@#%^ 
Sourceforce and trying to submit a Python bug report
but it still won't let me.  I give up.  Maybe someone who 
cares will see this post, or maybe it will save time for 
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem: 
 urllib returns corrupted data when reading an EUC-JP encoded 
 web page, from a python script run from a MS Windows .BAT 
file, but not when the same script is run from the command line.

Note: To reproduce this problem, it helps to have East Asian font 
support installed on the test system.  In Windows 2000:
  Control Panel, 
    Regional Options, General tab
      check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/) 
or Tamito KAJIYAMA's japanese codecs 
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed. 

To reproduce the problem...

1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
    ln = ln.decode ("cjkcodecs.euc-jp")
    print ln.encode("utf-8"),
----------------

2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
  python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1 >out1.txt
  test.bat >out2.txt

4. out1.txt and out2.txt should be identical.  But they are not.

The url used will return a EUC-JP encoded page with some japanese 
characters in it.  Test.py reads the page line by line, decodes 
the lines to unicode, reencodes them to UTF-8, and writes to a file.  
Thus the output file should be a UTF-8 version of the EUC-JP web page.  

The first command runs test.py directly.  The second command runs 
the identical command from a Windows batch file.  One should expect 
out1.txt and out2.txt to be identical.

out1.txt (created by running test.py from the command line) is 
correct (verify by opening out1.txt in notepad, and selecting a 
Japanese capable font, e.g. Lucida Sans Unicode).  The string in 
the first cell of the html table is the three japanese characters 
for word "taberu".

But in out2.txt (created by running test.py from a windows .bat 
file), instead of japanese characters there, we see an ascii text 
string "A9D9EB".  (The EUC-JP value of the actual japanese characters 
that should be there are \xBF\xA9\xA4\xD9\XA4\xEB, so the printed 
hex digits seems to come from alternate bytes of the EUC-JP string.

In other lines with japanese characters a similar effect is seen: 
the first two japanese character are replaced with with a string of 
hex digits.  Strangely, remaining japanese characters on the line
are not corrupted.

Running with a debugger shows that the corruption is in the text 
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.

So it looks like some bad mojo between urllib and the Windows
batch environment.





More information about the Python-list mailing list