a website information gathering script

Sat Mar 23 15:33:52 EST 2002

On Fri, 22 Mar 2002 07:39:06 -0600, Scott Hathaway <scott.l.hathaway at lmco.com> wrote:

>I have a simple python script which gathers information about a website
>and produces an html report about what it finds.  The way it works is
>very clunky and I would appreciate some feedback and help improving it.
>
>http://www.hcsprogramming.com/downloads/index.html
>
Unless you know that the format of your files is more restricted than their
types require, your criterion for finding a file reference is going to
get spurious hits (of course the other side is you can't statically find
strings that don't exist until run time either, when it comes to program sources).

E.g., for HTML you could do a much better job by using the HTMLParser module.
UIAM it also normalizes things to lower case for you, besides letting you
skip commented-out references and similar character sequences in
the data part. And since the parser makes it easy to get hrefs, you could
split and check #-references and definitions too.

The other files should really be parsed too, if you want to avoid spurious
matches.

As Emile pointed out, concatenating strings with '+' is relatively expensive,
and one way of doing it more cheaply is to put the pieces in a list and later
join them. You can do that in a style that looks similar to what you did, but
is a lot more efficient.

Note that strings separated by white space are concatenated by the compiler,
and in a () or [] context that white space can include EOLs, so you can mix
string fragments and expressions e.g., like this:

 >>> x= '<example>'
 >>> a = [
 ...     '123'
 ...     '456'
 ...     '\n', x*2 , '\n'
 ...     'abc'
 ...     'def'
 ... ]
 >>> a
 ['123456\n', '<example><example>', '\nabcdef']

Notice that 'a' wound up a list of _3_ strings here,
and note the commas isolating the x*2 expression.

 >>> print ''.join(a)
 123456
 <example><example>
 abcdef
 >>> a += ['\nYou can add this way too, for ', x[1:-1], '.']
 >>> print ''.join(a)
 123456
 <example><example>
 abcdef
 You can add this way too, for example.
 >>> a += [
 ... '\nOr'
 ...   ' this'
 ...     ' way'
 ...       '\n(noting that indent is ignored inside () or [].' ]
 >>> print ''.join(a)
 123456
 <example><example>
 abcdef
 You can add this way too, for example.
 Or this way
 (noting that indent is ignored inside () or [].
 >>>

Of course, you can arrange pieces to suit, but the point
is that it needn't be expensive to write a mix of string fragments
and expressions in a sequence, and you can lay it out in
multiple lines of source without a += 'next string piece' expense
at every line (noting that the similar looking
    a += ['a string ',foo(),' more']
is comparatively cheap).

Triple quotes of course are good for defining lines without using '\n',
and you can plug in pieces symbolically from a directory:

 >>> b = """<h2>%(bigText)s</h2>
 ... <font color="red">%(redText)s</font>
 ... etc...
 ... """
 >>> stuff = {'bigText':'Should be big', 'redText':'Should be red'}
 >>> b % stuff
 '<h2>Should be big</h2>\n<font color="red">Should be red</font>\netc...\n'
 >>> print b % stuff
 <h2>Should be big</h2>
 <font color="red">Should be red</font>
 etc...

You can also use vars() in place of stuff to pick up symbols
from the local or other dictionaries.

You might also want to consider using dictionaries for some
sets of things instead of lists.

To enhance the directory walker, you could give it globbing
and excluding and optional recursion controls. Perhaps make
it a class meant to be subclassed like HTMLParse, feeding it
glob strings and having it call convenient methods.

BTW, I'd try to avoid useless re-computation, as in putting
".tolower()" on stuff that's already guaranteed lower case
by previous code. And  file = str(file) looks like a no-op
in its context. Also, if you have a series of mutually
exclusive if conditions, making the if's after the first
into elif's will avoid executing tests after one succeeds.

BTW2, all those .endswith() tests would succeed on file names
without extensions but having the given suffix. You could fix
it, but Emile's version avoids that by splitting on '.' so you
gain two ways ;-)

That enough for this go ;-)

Regards,
Bengt Richter