Need Help with Programming Science Project

Sat Jan 25 06:31:08 EST 2014

On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote:

> I know. I'm kind of ashamed of the code, but it does the job I need it
> to up to a certain point

OK, well first of all take a step back and look at the problem.

You have n exemplars, each from a known author.

You analyse each exemplar, and determine some statistics for it.

You then take your unknown sample, determine the same statistics for the 
unknown sample.

Finally, you compare each exemplar's stats with the sample's stats to try 
and find a best match.

So, perhaps you want a dictionary of { author: statistics }, and a 
function to analyse a piece of text, which might call other functions to 
get eg avg words / sentence, avg letters / sentence, avg word length, and 
the sd in each, and the short word ratio (words <= 3 chars vs words >= 4 
chars) and some other statistics.

Given the statistics for each exemplar, you might store these in your 
dictionary as a tuple.

this isn't python, it's a description of an algorithm, it just looks a 
bit pythonic:

# tuple of weightings applied to different stats
stat_weightings = ( 1.0, 1.3, 0.85, ...... )

def get_some_stat( t ):
	# calculate some numerical statistic on a block of text
	# return it

def analyse( f ):
	text = read_file( f )
	return ( get_some_stat( text ), ...... )

exemplars = {}

for exemplar_file in exemplar_files:
	exemplar_data[author] = analyse( exemplar_file )

sample_data = analyse( sample_file )

scores = {}

tmp = 0
x = 0

# score for a piece of work is sum of ( diff of stat * weighting )
# for all the stats, lower score = closer match
for author in keys( exemplar_data ):
	for i in len( exemplar_data[ author ] ):
		tmp = tmp + sqrt( exemplar_data[ author ][ i ] - 
sample_data[ i ] ) * stat_weightings( i )
	scores[ author ] = tmp
	if tmp > x:
		x = tmp

names = []

for author in keys( scores ):
	if scores[ author ] < x:
		x = scores[ author ]
		names = [ author ]
	elif scores[ author ] == x:
		names.append( [ author ] )

print "the best matching author(s) is/are: ", names

Then all you have to do is find enough ways to calculate stats, and the 
magic coefficients to use in the stat_weightings

-- 
Denis McMahon, denismfmcmahon at gmail.com