Trimming X/HTML files

Thomas SMETS duvelbier-tsmets at yahoo.com
Sun Jul 31 11:06:59 EDT 2005


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The regular expression remove script out of an HTML/XHTML file is simple
enough but raises a major performance issue....

The following regular expression :
	r'(<script(\s*\S+\s*)+</script>)'
takes ages to complete in python on simple HTML file more than 3 minutes
of CPU time on a 150 lines HTML file. In jython it just never completes
but returns a painfull RunTimeException : maximum number of ??? reached.

Is the only way out dealing with strings and "match" instead of regular
expression ?
More over Jython is not yet 2.3 compliant, hence advanced features of
2.3 regular expression are not yet available !

\T,




Thomas SMETS wrote:
|
| Dear,
|
| I need to parse XHTML/HTML files in all ways :
| ~ _ Removing comments and javascripts is a first issue
| ~ _ Retrieving the list of fields to submit is my following item (todo)
|
| Any idea where I could find this already made ... ?
|
| \T,
|
|

- --
Thomas SMETS
Bruxelles
@ : duvelbier-tsmets at yahoo.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFC7OkTqN0SJr+xLBURAuTYAKDLxLv+hpnSrZ6uowOmUczVxgxLqwCYhfJ3
fwjPZzg88gh3lNY8jkG3SA==
=urIC
-----END PGP SIGNATURE-----



More information about the Python-list mailing list