HTML Content Rewriting

Doug Fort dougfort at dougfort.net
Wed Jun 27 09:53:37 EDT 2001


Content-Transfer-Encoding: 8Bit



Merton Campbell Crockett wrote:



> Several years ago, I developed a system for a customer that allowed their

> employees and customers to securely access web content on servers inside

> their firewall.  Basically, I used Apache's mod_rewrite module to

> implement what might be called a "dual reverse proxy".

> 

> Unfortunately, times have changed.  Several of the customer's

> organizations have started playing with various web development tools that

> create dynamic

> content.  Several of these embed information from the HTTP requests in the

> documents that are generated.

> 

> At a minimum this embedded information results in warnings about protocol

> changes, i.e. hard-coded links that specify an http: method when the

> remote

> users are using the https: method.  At worse, there are references to

> internal names and IP addresses that are not accessible from the Internet.

> 

> Both PHP and Python seem to provide capabilities that would allow "fix

> ups"

> to be applied to the content as it is delivered to the remote user. 

> Python looks like it might have a few more tools for manipulating HTML

> content.

> 

> What I would like to do is dynamically add a BASE tag to the document and

> convert all absolute to relative references if they involve the current

> web

> site.  For references to other web servers accessible through this

> facility, I would like to ensure that the references are in the external

> form and to disable the links to web servers that are not accessible by

> remote users.

> 

> What I would like from this group is some guidance.  Can this be done with

> Python?  Are there existing Python tools that might perform some of the

> functions that I would like performed?  What pitfalls and "gotchas" should

> one be aware?

> 

> Merton Campbell Crockett

> 

> 

> 



We have a similar situation in our website load testing application 

http://www.stressmy.com.  We want to parse each page as the user accesses 

it and change the links to point to our app so we can build a test case.  

I've attached our parser which may be of some use to you.



This does not address the affliction of javascript.  That's a whole 

different problem, but there's dozens of ways to create links dynamically.  

We handle them in a different way.  I'll post it if you're interested.

-- 

Doug Fort <dougfort at dougfort.net>

http://www.dougfort.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: filteringparser.py
Type: text/x-java
Size: 40067 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20010627/2f8a3b1c/attachment.java>


More information about the Python-list mailing list