bad data from urllib when run from MS .bat file

Mon Sep 20 11:35:35 EDT 2004

On Mon, 20 Sep 2004 07:33:13 -0600, "Stuart McGraw" <smcg4191 at frii.RimoovThisToReply.com> wrote:

>"Bengt Richter" <bokr at oz.net> wrote in message news:cilp2b$l3k$0$216.39.172.122 at theriver.com...
>> On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <smcg4191 at frii.RimoovThisToReply.com> wrote:
>> 
>> >I just spent a $*#@!*&^&% hour registering at ^$#@#%^ 
>> >Sourceforce and trying to submit a Python bug report
>> >but it still won't let me.  I give up.  Maybe someone who 
>> >cares will see this post, or maybe it will save time for 
>> >someone else who runs into this problem...
>> >
>> >================================================
>> >
>> >Environment:
>> >- Microsoft Windows 2000 Pro
>> >- Python 2.3.4
>> >- urllib (version shipped with Python-2.3.4)
>> >
>> >Problem: 
>> > urllib returns corrupted data when reading an EUC-JP encoded 
>> > web page, from a python script run from a MS Windows .BAT 
>> >file, but not when the same script is run from the command line.
>> Just a thought: in case your command line is being interpreted
>> by cmd.exe and .bat by something else (command.com?) you could
>> check if it makes a difference, e.g.,
>> 
>> copy test.bat test.cmd
>> 
>> and try it again? (explicitly as test.cmd, not just test, since any
>> same-name .com or .exe or .bat may have priority over .cmd)
>> You can probably investigate the latter by something like
>> 
>>  [21:54] C:\pywk\junk>echo %pathext%
>>  .COM;.EXE;.BAT;.CMD
>
>Well, I'm pretty sure cmd.exe was executing it, but I tried your
>suggestion to make absolutely sure.  Same results :-(
>Given the other (seeming) urllib problem I mentioned in another
>post in this thread, which appeared without any involvement
>of batch scripts, I am getting more and more suspicious that
>urllib is buggy, at least with non-single byte data.
>
Hm, what happens if you make a test2.py and pass it the name of an output
file instead of piping the output from print? In fact, eliminate the
encoding and the line generator and everything, and just let test2 copy the entire
server data in one single read and write it in binary. I.e,
     open(sys.argv[2],'wb').write(urllib.urlopen(...).read())

That should show whether python is seeing the identical input from the server.
Then you could do it line-wise (not with a print line ending in ",", but with
a binary file write). That would say whether line generation chunking on input
was doing anything to the data -- if possibly urrlib is buffering/chunking
differently for interactive vs bat file. Just grasping at straws, but eliminating
chunking, piping, re/encoding, binary vs text mode doubts from the test should
show why interactive vs .bat is different IWT.

Also, your mention of two-character errors made me wonder about spurious BOMs
or such from encoding file substrings as though they were entire files?
Would a final print for a final '\n' do anything that might trigger a final flush
differently with potential cooking consequence? (why the print with space instead BTW)?
What if you just do your own file.write output in binary and control everything?

Just some additional thoughts. Sorry the cmd vs bat thing didn't do anything.
BTW, what command line options are in use to start your interactive session
(it is console, not idle, right?). You didn't seem to have any (e.g. -u) in test.py.
Could the .BAT file be seeing a different environment? could the http://.. need quoting?
I.e., could the server be seeing a glitched url tail and be sending the same file but with some
different option?

Hope something gives you a useful idea. That's all I can think of for the moment ;-)

Regards,
Bengt Richter