piping input to an external script

Tue May 12 02:31:45 EDT 2009

Steve Howell wrote:
> On May 11, 10:16 pm, norseman <norse... at hughes.net> wrote:
>> Tim Arnold wrote:
>>> Hi, I have some html files that I want to validate by using an external
>>> script 'validate'. The html files need a doctype header attached before
>>> validation. The files are in utf8 encoding. My code:
>>> ---------------
>>> import os,sys
>>> import codecs,subprocess
>>> HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
>>> filename  = 'mytest.html'
>>> fd = codecs.open(filename,'rb',encoding='utf8')
>>> s = HEADER + fd.read()
>>> fd.close()
>>> p = subprocess.Popen(['validate'],
>>>                     stdin=subprocess.PIPE,
>>>                     stdout=subprocess.PIPE,
>>>                     stderr=subprocess.STDOUT)
>>> validate = p.communicate(unicode(s,encoding='utf8'))
>>> print validate
>>> ---------------
>>> I get lots of lines like this:
>>> Error at line 1, character 66:\tillegal character number 0
>>> etc etc.
>>> But I can give the command in a terminal 'cat mytest.html | validate' and
>>> get reasonable output. My subprocess code must be wrong, but I could use
>>> some help to see what the problem is.
>>> python2.5.1, freebsd6
>>> thanks,
>>> --Tim
>> ============================
>> If you search through the recent Python-List for UTF-8 things you might
>> get the same understanding I have come to.
>>
>> the problem is the use of python's 'print' subcommand or what ever it
>> is. It 'cooks' things and someone decided that it would only handle 1/2
>> of a byte (in the x'00 to x'7f' range) and ignore or send error messages
>> against anything else. I guess the person doing the deciding read the
>> part that says ASCII printables are in the 7 bit range and chose to
>> ignore the part about the rest of the byte being undefined. That is
>> undefined, not disallowed.  Means the high bit half can be used as
>> wanted since it isn't already taken. Nor did whoever it was take a look
>> around the computer world and realize the conflict that was going to be
>> generated by using only 1/2 of a byte in a 1byte+ world.
>>
>> If you can modify your code to use read and write you can bypass print
>> and be OK.  Or just have python do the 'cat mytest.html | validate' for
>> you. (Apply a var for html and let python accomplish the the equivalent
>> of Unix's:
>>     for f in *.html; do cat $f | validate; done
>>                          or
>>      for f in *.html; do validate $f; done  #file name available this way
>>
>> If you still have problems, take a look at os.POPEN2 (and its popen3)
>> Also take look at os.spawn.. et al
>>
> 
> Wow.  Unicode and subprocessing and printing can have dark corners,
> but common sense does apply in MOST situations.
> 
> If you send the header, add the newline.
> 
> But you do not need the header if you can cat the input file sans
> header and get sensible input.
> 

Yep!  The problem is with 'print'

> Finally, if you are concerned about adding the header, then it belongs
> in the original input file; otherwise, you are creating a false
> positive.

Steve