A better webpage filter

Anton Vredegoor anton.vredegoor at gmail.com
Sat Mar 24 14:45:41 EDT 2007


Since a few days I've been experimenting with a construct that enables 
me to send the sourcecode of the web page I'm reading through a Python 
script and then into a new tab in Mozilla. The new tab is automatically 
opened so the process feels very natural, although there's a lot of 
reading, filtering and writing behind the scene.

I want to do three things with this post:

A) Explain the process so that people can try it for themselves and say 
"Hey stupid, I've been doing the same thing with greasemonkey for ages", 
or maybe "You're great, this is easy to see, since the crux of the 
biscuit is the apostrophe."  Both kind of comments are very welcome.

B) Explain why I want such a thing.

C) If this approach is still valid after all the before, ask help for 
writing a better Python htmlfilter.py

So here we go:

A) Explain the process

We need :

- mozilla firefox http://en-us.www.mozilla.com/en-US/
- add-on viewsourcewith https://addons.mozilla.org/firefox/394/
- batch file (on windows):
(htmfilter.bat)
d:\python25\python.exe D:\Python25\Scripts\htmlfilter.py "%1" > out.html
start out.html
- a python script:
#htmfilter.py

import sys

def htmlfilter(fname, skip = []):
     f = file(fname)
     data = f.read()
     L = []
     for i,x in enumerate(data):
         if x == '<':
             j = i
         elif x =='>':
             L.append((j,i))
     R = list(data)
     for i,j in reversed(L):
         s = data[i:j+1]
         for x in skip:
             if x in s:
                 R[i:j+1] = ' '
                 break
     return ''.join(R)

def test():
     if len(sys.argv) == 2:
         skip = ['div','table']
         fname = sys.argv[1].strip()
         print htmlfilter(fname,skip)

if __name__=='__main__':
     test()

Now install the htmlfilter.py file in your Python scripts dir and adapt 
the batchfile to point to it.

To use the viewsourcewith add-on to open the batchfile: Go to some 
webpage, left click and view the source with the batchfile.

B) Explain why I want such a thing.

OK maybe this should have been the thing to start with, but hey it's 
such an interesting technique it's almost a waste no to give it a chance 
before my idea is dissed :-)

Most web pages I visit lately are taking so much room for ads (even with 
adblocker installed) that the mere 20 columns of text that are available 
for reading are slowing me down unacceptably. I have tried clicking 
'print this' or 'printer friendly' or using 'no style' from the mozilla 
menu and switching back again for other pages but it was tedious to say 
the least. Every webpage has different conventions. In the end I just 
started editing web pages' source code by hand, cutting out the beef and 
saving it as a html file with only text, no scripts or formatting. But 
that was also not very satisfying because raw web pages are *big*.

Then I found out I often could just replace all 'table' or 'div' 
elements with a space and the page -although not very html compliant any 
more- still loads and often the text looks a lot better. This worked for 
at least 50 percent of the pages and restored my autonomy and 
independence in reading web pages! (Which I do a lot by the way, maybe 
for most people the problem is not very irritating, because they don't 
read as much? Tell me that too, I want to know :-)

C) Ask help writing a better Python htmlfilter.py

Please. You see the code for yourself, this must be done better :-)

A.



More information about the Python-list mailing list