Filtering web proxy

Amit Patel amitp at Xenon.Stanford.EDU
Mon Apr 17 22:49:54 EDT 2000


 Neil Schemenauer <nascheme at enme.ucalgary.ca> wrote:
| 
| Yes and if your connection is fast enough that you don't need
| incremental loading you probably don't care too much about ads.
| In my experience, filtering ads greatly enhances your experience
| if your browsing on a slow connection.

Even with a fast connection, there are still things like filtering out
pop-ups, blocking cookies from certain sites, forcing pages to be
cachable (by modifying Cache-Control headers .. evil evil!) that can
be useful.  One really handy thing I wrote was to highlight keywords
you searched for, so for example, you search for "big dog" and visit a
page, and it'll highlight "big" and "dog" on that page.  However,
Google just added this feature (when you use its cached link) so I
don't need it so much in a proxy.

| Is the situation with XML the same as HTML?  Are XML documents
| forced to adhere to the standard or are parsers supposed to try
| to do something intelligent with whatever crap they get fed?

I believe XML and XHTML parsers are required to reject bad stuff.

| Saying that is parses HTML is a bit of a stretch however.  It
| just uses a couple of regexs.  I'm sure Tim Peters would love
| it. :)

Regexps can "parse" HTML tags.  There was something called REX.py that
was posted here somewhere.  It's based on REX for Perl, which is based
on Robert D. Cameron "REX: XML Shallow Parsing with Regular
Expressions", Technical Report TR 1998-17, School of Computing
Science, Simon Fraser University, November, 1998.

The idea is that REX is a humongous hairy regexp (sorry Timbot!) that
will match one tag or non-tag at a time.  You just keep feeding it
data  and it can keep giving you tags/non-tags.  With that, I hope to
build a filtering proxy that tokenizes all the HTML and does evil
transformations to it, incrementally.


	 - Amit



P.S.  Here's the regexp, just for Tim Peters:

'[^<]+|<(?:!(?:--(?:[^-]*-(?:[^-][^-]*-)*->?)?|\\[CDATA\\[(?:[^\\]]*](?:[^\\]]+])*]+(?:[^\\]>][^\\]]*](?:[^\\]]+])*]+)*>)?|DOCTYPE(?:[\\n\\t\\r]+(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*|"[^"]*"|\'[^\']*\'))*(?:[\\n\\t\\r]+)?(?:\\[(?:<(?:!(?:--[^-]*-(?:[^-][^-]*-)*->|[^-](?:[^\\]"\'><]+|"[^"]*"|\'[^\']*\')*>)|\\?(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:\\?>|[\\n\\r\\t][^?]*\\?+
(?:[^>?][^?]*\\?+)*>))|%(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*;|[\\n\\t\\r]+)*](?:[\\n\\t\\r]+)?)?>?)?)?|\\?(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:\\?>|[\\n\\r\\t][^?]*\\?+(?:[^>?][^?]*\\?+)*>)?)?|/(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+)?>?)?|(?:(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+(?:[A-Za-z_:]|[^\\x00-\\x7F])(?:[A-Za-z0-9_:.-]|[^\\x00-\\x7F])*(?:[\\n\\t\\r]+)?=?(?
:[ \\n\\t\\r]+)?(?:"[^<"]*"|\'[^<\']*\'|\\w+))*(?:[ \\n\\t\\r]+)?/?>?)?)'

--
Amit J Patel, Computer Science Department, Stanford University
http://www-cs-students.stanford.edu/~amitp/


-- 
--
Amit J Patel, Computer Science Department, Stanford University
http://www-cs-students.stanford.edu/~amitp/



More information about the Python-list mailing list