piping input to an external script

Tue May 12 02:50:23 EDT 2009

On May 11, 11:31 pm, norseman <norse... at hughes.net> wrote:
> Steve Howell wrote:
> > On May 11, 10:16 pm, norseman <norse... at hughes.net> wrote:
> >> Tim Arnold wrote:
> >>> Hi, I have some html files that I want to validate by using an external
> >>> script 'validate'. The html files need a doctype header attached before
> >>> validation. The files are in utf8 encoding. My code:
> >>> ---------------
> >>> import os,sys
> >>> import codecs,subprocess
> >>> HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
> >>> filename  = 'mytest.html'
> >>> fd = codecs.open(filename,'rb',encoding='utf8')
> >>> s = HEADER + fd.read()
> >>> fd.close()
> >>> p = subprocess.Popen(['validate'],
> >>>                     stdin=subprocess.PIPE,
> >>>                     stdout=subprocess.PIPE,
> >>>                     stderr=subprocess.STDOUT)
> >>> validate = p.communicate(unicode(s,encoding='utf8'))
> >>> print validate
> >>> ---------------
> >>> I get lots of lines like this:
> >>> Error at line 1, character 66:\tillegal character number 0
> >>> etc etc.
> >>> But I can give the command in a terminal 'cat mytest.html | validate' and
> >>> get reasonable output. My subprocess code must be wrong, but I could use
> >>> some help to see what the problem is.
> >>> python2.5.1, freebsd6
> >>> thanks,
> >>> --Tim
> >> ============================
> >> If you search through the recent Python-List for UTF-8 things you might
> >> get the same understanding I have come to.
>
> >> the problem is the use of python's 'print' subcommand or what ever it
> >> is. It 'cooks' things and someone decided that it would only handle 1/2
> >> of a byte (in the x'00 to x'7f' range) and ignore or send error messages
> >> against anything else. I guess the person doing the deciding read the
> >> part that says ASCII printables are in the 7 bit range and chose to
> >> ignore the part about the rest of the byte being undefined. That is
> >> undefined, not disallowed.  Means the high bit half can be used as
> >> wanted since it isn't already taken. Nor did whoever it was take a look
> >> around the computer world and realize the conflict that was going to be
> >> generated by using only 1/2 of a byte in a 1byte+ world.
>
> >> If you can modify your code to use read and write you can bypass print
> >> and be OK.  Or just have python do the 'cat mytest.html | validate' for
> >> you. (Apply a var for html and let python accomplish the the equivalent
> >> of Unix's:
> >>     for f in *.html; do cat $f | validate; done
> >>                          or
> >>      for f in *.html; do validate $f; done  #file name available this way
>
> >> If you still have problems, take a look at os.POPEN2 (and its popen3)
> >> Also take look at os.spawn.. et al
>
> > Wow.  Unicode and subprocessing and printing can have dark corners,
> > but common sense does apply in MOST situations.
>
> > If you send the header, add the newline.
>
> > But you do not need the header if you can cat the input file sans
> > header and get sensible input.
>
> Yep!  The problem is with 'print'
>

Huh?  Print is printing exactly what you expect it to print.