piping input to an external script

norseman norseman at hughes.net
Tue May 12 12:37:17 EDT 2009


Steve Howell wrote:
> On May 11, 11:31 pm, norseman <norse... at hughes.net> wrote:
>> Steve Howell wrote:
>>> On May 11, 10:16 pm, norseman <norse... at hughes.net> wrote:
>>>> Tim Arnold wrote:
>>>>> Hi, I have some html files that I want to validate by using an external
>>>>> script 'validate'. The html files need a doctype header attached before
>>>>> validation. The files are in utf8 encoding. My code:
>>>>> ---------------
>>>>> import os,sys
>>>>> import codecs,subprocess
>>>>> HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
>>>>> filename  = 'mytest.html'
>>>>> fd = codecs.open(filename,'rb',encoding='utf8')
>>>>> s = HEADER + fd.read()
>>>>> fd.close()
>>>>> p = subprocess.Popen(['validate'],
>>>>>                     stdin=subprocess.PIPE,
>>>>>                     stdout=subprocess.PIPE,
>>>>>                     stderr=subprocess.STDOUT)
>>>>> validate = p.communicate(unicode(s,encoding='utf8'))
>>>>> print validate
>>>>> ---------------
>>>>> I get lots of lines like this:
>>>>> Error at line 1, character 66:\tillegal character number 0
>>>>> etc etc.
>>>>> But I can give the command in a terminal 'cat mytest.html | validate' and
>>>>> get reasonable output. My subprocess code must be wrong, but I could use
>>>>> some help to see what the problem is.
>>>>> python2.5.1, freebsd6
>>>>> thanks,
>>>>> --Tim
>>>> ============================
>>>> If you search through the recent Python-List for UTF-8 things you might
>>>> get the same understanding I have come to.
>>>> the problem is the use of python's 'print' subcommand or what ever it
>>>> is. It 'cooks' things and someone decided that it would only handle 1/2
>>>> of a byte (in the x'00 to x'7f' range) and ignore or send error messages
>>>> against anything else. I guess the person doing the deciding read the
>>>> part that says ASCII printables are in the 7 bit range and chose to
>>>> ignore the part about the rest of the byte being undefined. That is
>>>> undefined, not disallowed.  Means the high bit half can be used as
>>>> wanted since it isn't already taken. Nor did whoever it was take a look
>>>> around the computer world and realize the conflict that was going to be
>>>> generated by using only 1/2 of a byte in a 1byte+ world.
>>>> If you can modify your code to use read and write you can bypass print
>>>> and be OK.  Or just have python do the 'cat mytest.html | validate' for
>>>> you. (Apply a var for html and let python accomplish the the equivalent
>>>> of Unix's:
>>>>     for f in *.html; do cat $f | validate; done
>>>>                          or
>>>>      for f in *.html; do validate $f; done  #file name available this way
>>>> If you still have problems, take a look at os.POPEN2 (and its popen3)
>>>> Also take look at os.spawn.. et al
>>> Wow.  Unicode and subprocessing and printing can have dark corners,
>>> but common sense does apply in MOST situations.
>>> If you send the header, add the newline.
>>> But you do not need the header if you can cat the input file sans
>>> header and get sensible input.
>> Yep!  The problem is with 'print'
>>
> 
> Huh?  Print is printing exactly what you expect it to print.
> 
===============
My apologies.

Tim: Using what you posted;
Is the third char of the first line read from file a TAB?

Just curious.  len(HEADER) is 63, error at 66  char number 0, doesn't 
seem quite consistent math wise.
63 + cr + lf gives 65.  But, as another noted, you don't have those.
"...66:\tillegal..."  is '\t' a tab on screen or byte 1 or 3 of file?
If you have mc available, in it - highlight file and press Shift-F3 then 
F4.  09 is TAB

</title> is closing, should not exist as opener
<html>   can be opener, did the h somehow become a '\'
          (still - that would put x'09' at byte 2 of file)

Most validate programs I have used will let me know the header is 
missing if in fact it is and give me a choice of how to process (XML, 
XHTML, HTML 1.1, ...) or quit.

is HEADER ('<!DOC...>') itself already in utf-8?
Or are you mixing things?

Last but not least - if you have source of validate process, check that 
over carefully.  The numbers don't work for me.

Just thinking on paper. No need to respond.

Steve



More information about the Python-list mailing list