Need Help with Programming Science Project

Fri Jan 24 06:07:35 EST 2014

theguy wrote:

> I have a science project that involves designing a program which can
> examine a bit of text with the author's name given, then figure out who
> the author is if another piece of example text without the name is given.
> I so far have three different authors in the program and have already put
> in the example text but for some reason, the program always leans toward
> one specific author, Suzanne Collins, no matter what insane number I try
> to put in or how much I tinker with the coding. I would post the code, but
> I don't know if it's fine to put it here, as it contains pieces from
> books. I do believe that would go against copyright laws. If I can figure
> out a way to put it in without the bits from the stories, then I'll do so,
> but as of now, any help is appreciated. I understand I'm not exactly mak
>  ing it easy since I'm not putting up any code, but I'm kind of desperate
>  for help here, as I can't seem to find anybody or anything else helpful
>  in any way. Thank you.

If I were to speculate what your program might look like:

text_samples = {
    "Suzanne Collins": "... some text by collins ...",
    "J. K. Rowling": "... some text by rowling ...",
    #...
}

unknown = "... sample text by unknown author ..."

def calc_match(text1, text2):
   import random
   return random.random()

guessed_author = None
guessed_match = None

for author, text in text_samples.items():
   match = calc_match(unknown, text)
   print(author, match)
   if guessed_author is None or match > guessed_match:
       guessed_author = author
       guessed_match = match

print("The author is", guessed_author)

The important part in this script are not the text samples or the loop to 
determine the best match -- it's the algorithm used to determine how good 
two texts match. 
In the above example that algorithm is encapsulated in the calc_match() 
function and it's really bad, it gives you random numbers between 0 and 1.

For us to help you it should be sufficient when you post the analog of this 
function in your code together with a description in plain english of how it 
is meant to calculate the similarity between two texts.

Alternatavely, instead of the copyrighted texts grab text samples from 
project gutenberg with expired copyright.

Make sure that the resulting post is as short as possible -- long text 
samples don't make the post clearer than short ones.