Need Help with Programming Science Project
Peter Otten
__peter__ at web.de
Fri Jan 24 06:07:35 EST 2014
theguy wrote:
> I have a science project that involves designing a program which can
> examine a bit of text with the author's name given, then figure out who
> the author is if another piece of example text without the name is given.
> I so far have three different authors in the program and have already put
> in the example text but for some reason, the program always leans toward
> one specific author, Suzanne Collins, no matter what insane number I try
> to put in or how much I tinker with the coding. I would post the code, but
> I don't know if it's fine to put it here, as it contains pieces from
> books. I do believe that would go against copyright laws. If I can figure
> out a way to put it in without the bits from the stories, then I'll do so,
> but as of now, any help is appreciated. I understand I'm not exactly mak
> ing it easy since I'm not putting up any code, but I'm kind of desperate
> for help here, as I can't seem to find anybody or anything else helpful
> in any way. Thank you.
If I were to speculate what your program might look like:
text_samples = {
"Suzanne Collins": "... some text by collins ...",
"J. K. Rowling": "... some text by rowling ...",
#...
}
unknown = "... sample text by unknown author ..."
def calc_match(text1, text2):
import random
return random.random()
guessed_author = None
guessed_match = None
for author, text in text_samples.items():
match = calc_match(unknown, text)
print(author, match)
if guessed_author is None or match > guessed_match:
guessed_author = author
guessed_match = match
print("The author is", guessed_author)
The important part in this script are not the text samples or the loop to
determine the best match -- it's the algorithm used to determine how good
two texts match.
In the above example that algorithm is encapsulated in the calc_match()
function and it's really bad, it gives you random numbers between 0 and 1.
For us to help you it should be sufficient when you post the analog of this
function in your code together with a description in plain english of how it
is meant to calculate the similarity between two texts.
Alternatavely, instead of the copyrighted texts grab text samples from
project gutenberg with expired copyright.
Make sure that the resulting post is as short as possible -- long text
samples don't make the post clearer than short ones.
More information about the Python-list
mailing list