Readability (html purifier) in Python

Дамјан Георгиевски gdamjan at gmail.com
Wed Jun 16 15:51:21 EDT 2010


>> http://lab.arc90.com/experiments/readability/
>>
>> Readability is a javascript bookmarklet that "makes reading on the
>> Web more enjoyable by removing the clutter around what you're
>> reading."
>>
>> Does anyone know of something similar in Python?
> 
> Well, that sounds like a browser tool.

yes, it's a bookmarklet, a tiny javascript code that when clicked runs 
on the current document in the browser.

> Could you be a bit more specific about what kind of "similar" 
> functionality you would expect from a "similar" Python tool? 
> How would you tell it "what you're reading", for example?

I'm not sure I understand your question corectly, but anyway. 

What I need is a package that given a random html document (a page from 
any random website) would extract the meaningful content, and filter the 
junk (advertisments, non-content elements, any other UI etc.)


Readability seems to do some herustictical manipulation of the DOM, but 
I'm not that good at reading/understanding it's source-code. Of course 
it can't be 100% correct, but it's good enough in many cases.

http://code.google.com/p/arc90labs-
readability/source/browse/trunk/js/readability.js



-- 
дамјан ((( http://damjan.softver.org.mk/ )))

war is peace
freedom is slavery
restrictions are enablement




More information about the Python-list mailing list