Find Paths in log text - How to?

citizenkahn citizenkahn at gmail.com
Thu Mar 23 14:50:58 EST 2006


I am trying to parse a build log for errors.  I figure I can do this
one of three ways:
- find the absolute platonic form of an error and search for that item
- create definitions of what patterns describe errors for each tool
which is used (ant, MSDEV, etc).
- rework the build such that all the return codes for all 3rd party are
captured and logged using my own error description

The return code method has a high level of effort attached and spreads
the responsibility for the task quite widely unless I can write a
little command pattern like wrapper script (which is a possibility).
Still it would mean all of the 100s of calls to tools would have to
wrapped and if any place a developer changed this, the build would leak
errors.

The relativism/definition based approach means that I must be sure to
capture all error cases which may prove difficult and false negatives
are a really dangerous problem.

In the Platonic/absolutist camp, I could define an error as an instance
of a word or phrase from the "bad list" that is not in a filename or
path.

Bad List: [error, fatal, killed, not found].

Were I to go this way, I'd be faced with a major problem:  in a world
where symbols and whitespace can be included in a path how can I
extract a path from a line of text?

Ugly Valid Paths:
	C:\Program Files\A File Named Error .txt
	/usr/#a file named error #.txt


This means that determining the boundaries of a path is non trivial.

FileNames:
	Many build tools list filenames without their full path.  All of my
product's
	files are <text>.<ext>, so that is a pattern that I might be able to
locate
		.+\..+ perhaps

Paths:
	on windows all of my paths will start with [A-Z]:\  or  \\
	on unix the will tend to start with ./ or /.
Finding the starting point is not too difficult, but its that ending
that's hard


I could generate a substring for each of the starting types and then
look at what came before.
	for sep in [letterStart, uncStart, unixrootedStart, unixpwdStart]:
		# create sub string
		prePath = line.split(elem)[0]
		checkForBadWords(prePath)

I could then split the postPath segment on the os.sep and the check for
unlikely cases in the list elements
- double spaces within a path element
- symbol characters within an element (although this is a little dicey)

Since I am parsing the log on the system on which it was generated, for
each path I could do an os.path.exists on the potential path.

If someone happens to know of  a good method of extracting weird paths
out of logs, I'd be interested in hearing about it.




More information about the Python-list mailing list