Filtering web proxy

Mon Apr 17 17:41:13 EDT 2000

>>>>> "Oleg" == Oleg Broytmann <phd at phd.russ.ru> writes:

    Oleg> Hello!  I want a filtering web proxy. I can write one
    Oleg> myself, but if there is a thing already... well, I don't
    Oleg> want to reinvent the wheel. If there is such thing (free and
    Oleg> opensourse, 'course), I'll extend it for my needs.

i am using junkbuster. it's simple, and works well.

it only does url blocking (so banners), but i never have javascript
turned on anyway. it can block cookies selectively (i have everything
but slashdot blocked), hide/spoof user-agent, and use other http
proxies.

it doesn't do html parsing, and it's not written in python though.
but i haven't missed anything from it so it being written
in c hasn't bothered me much. with javascript/java turned
off there's not so much need for html parsing...

    Oleg>    I wrote dozen HTML parsers in Python, so I can write one
    Oleg> more, and turn it into a proxy, but may be I can start with
    Oleg> some already debugged code?

a html parser would need to work incrementally, unless you want to
wait for the whole document to be transferred over the network before
seeing any of it rendered.

i guess you could do it incrementally with sgmllib (iirc you feed it a
file object?), but you run into the fact that a big part of
the html documents on the web are malformed and rely on the
error correcting heuristics of the major browsers to function...

one starting point could be the "gray proxy" (i forget what it was
really called). that was written on top of medusa, i think there was
an announcement here? probably a year or so ago.  it parsed the html
and changed all the colors to grayscale, and did the same for
images. medusa isn't free though.. (except the version in zope?)

mozilla allows you to block/allow javascript by domain, iirc.

  -- erno