HTML Content Rewriting
Doug Fort
dougfort at dougfort.net
Wed Jun 27 09:53:37 EDT 2001
Content-Transfer-Encoding: 8Bit
Merton Campbell Crockett wrote:
> Several years ago, I developed a system for a customer that allowed their
> employees and customers to securely access web content on servers inside
> their firewall. Basically, I used Apache's mod_rewrite module to
> implement what might be called a "dual reverse proxy".
>
> Unfortunately, times have changed. Several of the customer's
> organizations have started playing with various web development tools that
> create dynamic
> content. Several of these embed information from the HTTP requests in the
> documents that are generated.
>
> At a minimum this embedded information results in warnings about protocol
> changes, i.e. hard-coded links that specify an http: method when the
> remote
> users are using the https: method. At worse, there are references to
> internal names and IP addresses that are not accessible from the Internet.
>
> Both PHP and Python seem to provide capabilities that would allow "fix
> ups"
> to be applied to the content as it is delivered to the remote user.
> Python looks like it might have a few more tools for manipulating HTML
> content.
>
> What I would like to do is dynamically add a BASE tag to the document and
> convert all absolute to relative references if they involve the current
> web
> site. For references to other web servers accessible through this
> facility, I would like to ensure that the references are in the external
> form and to disable the links to web servers that are not accessible by
> remote users.
>
> What I would like from this group is some guidance. Can this be done with
> Python? Are there existing Python tools that might perform some of the
> functions that I would like performed? What pitfalls and "gotchas" should
> one be aware?
>
> Merton Campbell Crockett
>
>
>
We have a similar situation in our website load testing application
http://www.stressmy.com. We want to parse each page as the user accesses
it and change the links to point to our app so we can build a test case.
I've attached our parser which may be of some use to you.
This does not address the affliction of javascript. That's a whole
different problem, but there's dozens of ways to create links dynamically.
We handle them in a different way. I'll post it if you're interested.
--
Doug Fort <dougfort at dougfort.net>
http://www.dougfort.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: filteringparser.py
Type: text/x-java
Size: 40067 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20010627/2f8a3b1c/attachment.java>
More information about the Python-list
mailing list