urllib.request giving unexpected results

Wed Nov 16 03:09:59 EST 2016

I'm trying to download a file using urllib.request and pipe it straight to an 
external process. On Linux systems, the following is a test file that 
demonstrates the problem:

--- cut ---

#!/usr/bin/python3.5

import urllib.request
import subprocess

TEST_URL = 'https://www.irs.gov/pub/irs-prior/f1040--1864.pdf'

with urllib.request.urlopen(TEST_URL) as f:
    data = subprocess.check_output(['file', '-'], stdin=f)
    print(data)

with urllib.request.urlopen(TEST_URL) as f:
    with open('/tmp/x.pdf', 'wb') as g:
        n = g.write(f.read())
    with open('/tmp/x.pdf') as g:
        data = subprocess.check_output(['file', '-'], stdin=g)
    print(data)

--- cut ---

Output is:

b'/dev/stdin: data\n'
b'/dev/stdin: PDF document, version 1.6\n'

Expected output is:

b'/dev/stdin: PDF document, version 1.6\n'
b'/dev/stdin: PDF document, version 1.6\n'

If I just read from urllib.request, I get what appears to the naked eye to be 
the expected data:

py> with urllib.request.urlopen(TEST_URL) as f:
...     file = f.read()
... 
py> print(file[:100])
b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n55 0 obj\r<</Linearized 1/L 66721/O 57/E 
28286/N 4/T 65574/H [ 856 317]>>\rendobj\r    '

Certainly looks like a PDF file. So what's going on?

-- 
Steven
299792.458 km/s — not just a good idea, it’s the law!