a website information gathering script

Tue Mar 26 11:45:00 EST 2002

Thanks to both of you for your help!  I will improve teh script when I have some time.

Scott

Bengt Richter wrote:

> On Fri, 22 Mar 2002 07:39:06 -0600, Scott Hathaway <scott.l.hathaway at lmco.com> wrote:
>
> >I have a simple python script which gathers information about a website
> >and produces an html report about what it finds.  The way it works is
> >very clunky and I would appreciate some feedback and help improving it.
> >
> >http://www.hcsprogramming.com/downloads/index.html
> >
> Unless you know that the format of your files is more restricted than their
> types require, your criterion for finding a file reference is going to
> get spurious hits (of course the other side is you can't statically find
> strings that don't exist until run time either, when it comes to program sources).
>
> E.g., for HTML you could do a much better job by using the HTMLParser module.
> UIAM it also normalizes things to lower case for you, besides letting you
> skip commented-out references and similar character sequences in
> the data part. And since the parser makes it easy to get hrefs, you could
> split and check #-references and definitions too.
>
> The other files should really be parsed too, if you want to avoid spurious
> matches.
>
> As Emile pointed out, concatenating strings with '+' is relatively expensive,
> and one way of doing it more cheaply is to put the pieces in a list and later
> join them. You can do that in a style that looks similar to what you did, but
> is a lot more efficient.
>
> Note that strings separated by white space are concatenated by the compiler,
> and in a () or [] context that white space can include EOLs, so you can mix
> string fragments and expressions e.g., like this:
>
>  >>> x= '<example>'
>  >>> a = [
>  ...     '123'
>  ...     '456'
>  ...     '\n', x*2 , '\n'
>  ...     'abc'
>  ...     'def'
>  ... ]
>  >>> a
>  ['123456\n', '<example><example>', '\nabcdef']
>
> Notice that 'a' wound up a list of _3_ strings here,
> and note the commas isolating the x*2 expression.
>
>  >>> print ''.join(a)
>  123456
>  <example><example>
>  abcdef
>  >>> a += ['\nYou can add this way too, for ', x[1:-1], '.']
>  >>> print ''.join(a)
>  123456
>  <example><example>
>  abcdef
>  You can add this way too, for example.
>  >>> a += [
>  ... '\nOr'
>  ...   ' this'
>  ...     ' way'
>  ...       '\n(noting that indent is ignored inside () or [].' ]
>  >>> print ''.join(a)
>  123456
>  <example><example>
>  abcdef
>  You can add this way too, for example.
>  Or this way
>  (noting that indent is ignored inside () or [].
>  >>>
>
> Of course, you can arrange pieces to suit, but the point
> is that it needn't be expensive to write a mix of string fragments
> and expressions in a sequence, and you can lay it out in
> multiple lines of source without a += 'next string piece' expense
> at every line (noting that the similar looking
>     a += ['a string ',foo(),' more']
> is comparatively cheap).
>
> Triple quotes of course are good for defining lines without using '\n',
> and you can plug in pieces symbolically from a directory:
>
>  >>> b = """<h2>%(bigText)s</h2>
>  ... <font color="red">%(redText)s</font>
>  ... etc...
>  ... """
>  >>> stuff = {'bigText':'Should be big', 'redText':'Should be red'}
>  >>> b % stuff
>  '<h2>Should be big</h2>\n<font color="red">Should be red</font>\netc...\n'
>  >>> print b % stuff
>  <h2>Should be big</h2>
>  <font color="red">Should be red</font>
>  etc...
>
> You can also use vars() in place of stuff to pick up symbols
> from the local or other dictionaries.
>
> You might also want to consider using dictionaries for some
> sets of things instead of lists.
>
> To enhance the directory walker, you could give it globbing
> and excluding and optional recursion controls. Perhaps make
> it a class meant to be subclassed like HTMLParse, feeding it
> glob strings and having it call convenient methods.
>
> BTW, I'd try to avoid useless re-computation, as in putting
> ".tolower()" on stuff that's already guaranteed lower case
> by previous code. And  file = str(file) looks like a no-op
> in its context. Also, if you have a series of mutually
> exclusive if conditions, making the if's after the first
> into elif's will avoid executing tests after one succeeds.
>
> BTW2, all those .endswith() tests would succeed on file names
> without extensions but having the given suffix. You could fix
> it, but Emile's version avoids that by splitting on '.' so you
> gain two ways ;-)
>
> That enough for this go ;-)
>
> Regards,
> Bengt Richter