a website information gathering script
Scott Hathaway
scott.l.hathaway at lmco.com
Tue Mar 26 11:45:00 EST 2002
Thanks to both of you for your help! I will improve teh script when I have some time.
Scott
Bengt Richter wrote:
> On Fri, 22 Mar 2002 07:39:06 -0600, Scott Hathaway <scott.l.hathaway at lmco.com> wrote:
>
> >I have a simple python script which gathers information about a website
> >and produces an html report about what it finds. The way it works is
> >very clunky and I would appreciate some feedback and help improving it.
> >
> >http://www.hcsprogramming.com/downloads/index.html
> >
> Unless you know that the format of your files is more restricted than their
> types require, your criterion for finding a file reference is going to
> get spurious hits (of course the other side is you can't statically find
> strings that don't exist until run time either, when it comes to program sources).
>
> E.g., for HTML you could do a much better job by using the HTMLParser module.
> UIAM it also normalizes things to lower case for you, besides letting you
> skip commented-out references and similar character sequences in
> the data part. And since the parser makes it easy to get hrefs, you could
> split and check #-references and definitions too.
>
> The other files should really be parsed too, if you want to avoid spurious
> matches.
>
> As Emile pointed out, concatenating strings with '+' is relatively expensive,
> and one way of doing it more cheaply is to put the pieces in a list and later
> join them. You can do that in a style that looks similar to what you did, but
> is a lot more efficient.
>
> Note that strings separated by white space are concatenated by the compiler,
> and in a () or [] context that white space can include EOLs, so you can mix
> string fragments and expressions e.g., like this:
>
> >>> x= '<example>'
> >>> a = [
> ... '123'
> ... '456'
> ... '\n', x*2 , '\n'
> ... 'abc'
> ... 'def'
> ... ]
> >>> a
> ['123456\n', '<example><example>', '\nabcdef']
>
> Notice that 'a' wound up a list of _3_ strings here,
> and note the commas isolating the x*2 expression.
>
> >>> print ''.join(a)
> 123456
> <example><example>
> abcdef
> >>> a += ['\nYou can add this way too, for ', x[1:-1], '.']
> >>> print ''.join(a)
> 123456
> <example><example>
> abcdef
> You can add this way too, for example.
> >>> a += [
> ... '\nOr'
> ... ' this'
> ... ' way'
> ... '\n(noting that indent is ignored inside () or [].' ]
> >>> print ''.join(a)
> 123456
> <example><example>
> abcdef
> You can add this way too, for example.
> Or this way
> (noting that indent is ignored inside () or [].
> >>>
>
> Of course, you can arrange pieces to suit, but the point
> is that it needn't be expensive to write a mix of string fragments
> and expressions in a sequence, and you can lay it out in
> multiple lines of source without a += 'next string piece' expense
> at every line (noting that the similar looking
> a += ['a string ',foo(),' more']
> is comparatively cheap).
>
> Triple quotes of course are good for defining lines without using '\n',
> and you can plug in pieces symbolically from a directory:
>
> >>> b = """<h2>%(bigText)s</h2>
> ... <font color="red">%(redText)s</font>
> ... etc...
> ... """
> >>> stuff = {'bigText':'Should be big', 'redText':'Should be red'}
> >>> b % stuff
> '<h2>Should be big</h2>\n<font color="red">Should be red</font>\netc...\n'
> >>> print b % stuff
> <h2>Should be big</h2>
> <font color="red">Should be red</font>
> etc...
>
> You can also use vars() in place of stuff to pick up symbols
> from the local or other dictionaries.
>
> You might also want to consider using dictionaries for some
> sets of things instead of lists.
>
> To enhance the directory walker, you could give it globbing
> and excluding and optional recursion controls. Perhaps make
> it a class meant to be subclassed like HTMLParse, feeding it
> glob strings and having it call convenient methods.
>
> BTW, I'd try to avoid useless re-computation, as in putting
> ".tolower()" on stuff that's already guaranteed lower case
> by previous code. And file = str(file) looks like a no-op
> in its context. Also, if you have a series of mutually
> exclusive if conditions, making the if's after the first
> into elif's will avoid executing tests after one succeeds.
>
> BTW2, all those .endswith() tests would succeed on file names
> without extensions but having the given suffix. You could fix
> it, but Emile's version avoids that by splitting on '.' so you
> gain two ways ;-)
>
> That enough for this go ;-)
>
> Regards,
> Bengt Richter
More information about the Python-list
mailing list